1 Introduction

This R Markdown script (and the text data files and the HTML file) contains all codes used for computing complexity metrics, analysing results, plotting figures and tables.

2 Linguistic complexity across languages: a short overview

No Supplementary Information is provided for Section 2.

3 Corpus description and subsampling strategy

A dataset of 47 typologically and geographically diverse languages is used in this study. The table below provides information on their morphological strategies such as Fusion (derived from WALS Chapter 20), Exponence (WALS Chapter 21B), and the amount of Verbal Inflection (WALS Chapter 22) and also phonological complexity estimated by the degree of Syllable Complexity (WALS Chapter 12) and Tonal System (WALS Chapter 13, except for Daga Maddieson et al. 2013 and Barasano Gomez-Imbert & Kenstowicz 2000).

Language description
Language Glottocode Exponence Strategy Fusion VerbInflection SyllableStructure TonalSystem Macroarea
Alamblak alam1246 MonoV Morph Concatenative Mid High Complex No tones Papunesia
Amele amel1241 MonoV Balanced Concatenative Mid High Moderately complex No tones Papunesia
Apurinã apur1254 MonoV Balanced Concatenative Mid High Simple No tones South America
Arabic (Egyptian) egyp1253 Poly Morph Other Mid High Complex No tones Africa
Arapesh (Mountain) buki1249 MonoV Balanced Concatenative Mid High Moderately complex No tones Papunesia
Barasano bara1380 MonoVN Balanced Concatenative Mid High Simple Simple tone system South America
Basque basq1248 MonoVN Morph Concatenative Mid High Complex No tones Eurasia
Burmese nucl1310 MonoVN Lexical Concatenative Low Moderately complex Complex tone system Eurasia
Chamorro cham1312 Poly Balanced Other Mid High Moderately complex No tones Papunesia
Daga daga1275 Poly Morph Concatenative Mid High Moderately complex No tones Papunesia
English stan1293 MonoV Lexical Concatenative Low Complex No tones Eurasia
Fijian fiji1243 MonoV Lexical Isolating Mid High Simple No tones Papunesia
Finnish finn1318 Poly Balanced Concatenative Low Moderately complex No tones Eurasia
French stan1290 Poly Balanced Concatenative Mid High Complex No tones Eurasia
Georgian nucl1302 Poly Morph Concatenative Mid High Complex No tones Eurasia
German stan1295 Poly Balanced Concatenative Low Complex No tones Eurasia
Greek (Modern) mode1248 Poly Morph Concatenative Mid High Complex No tones Eurasia
Greenlandic (West) kala1399 Poly Morph Concatenative Mid High Moderately complex No tones Eurasia
Guaraní para1311 MonoV Morph Concatenative Mid High Simple No tones South America
Hausa haus1257 MonoV Lexical Isolating Mid High Moderately complex Simple tone system Africa
Hindi hind1269 Poly Morph Concatenative Low Complex No tones Eurasia
Indonesian indo1316 MonoV Lexical Isolating Mid High Complex No tones Papunesia
Jakaltek popt1235 MonoV Balanced Concatenative Mid High Moderately complex No tones North America
Kewa west2599 Poly Morph Concatenative Mid High Simple Simple tone system Papunesia
Khalkha halh1238 MonoVN Balanced Concatenative Low Complex No tones Eurasia
Khoekhoe nama1264 MonoVN Balanced Other Mid High Moderately complex Complex tone system Africa
Korean kore1280 MonoVN Balanced Concatenative Mid High Moderately complex No tones Eurasia
Lango lang1324 Poly Balanced Other Mid High Moderately complex Simple tone system Africa
Malagasy plat1254 MonoVN Lexical Concatenative Mid High Moderately complex No tones Africa
Mapudungun mapu1245 Poly Balanced Concatenative Mid High Moderately complex No tones South America
Mixtec (Chalcatongo) sanm1295 MonoV Balanced Concatenative Mid High Simple Complex tone system North America
Oromo (Harar) east2652 MonoVN Balanced Concatenative Mid High Moderately complex Complex tone system Africa
Persian west2369 MonoVN Morph Concatenative Mid High Complex No tones Eurasia
Quechua (Imbabura) imba1240 MonoVN Morph Concatenative Mid High Moderately complex No tones South America
Russian russ1263 Poly Balanced Concatenative Mid High Complex No tones Eurasia
Sango sang1328 None Lexical Concatenative Low Simple Simple tone system Africa
Sanumá sanu1240 MonoVN Lexical Concatenative Mid High Moderately complex No tones South America
Spanish stan1288 Poly Morph Concatenative Mid High Moderately complex No tones Eurasia
Swahili swah1253 MonoV Morph Concatenative Mid High Simple No tones Africa
Tagalog taga1270 Poly Balanced Concatenative Low Moderately complex No tones Papunesia
Thai thai1261 MonoV Lexical Other Low Moderately complex Complex tone system Eurasia
Turkish nucl1301 MonoVN Morph Concatenative Mid High Moderately complex No tones Eurasia
Vietnamese viet1252 MonoV Lexical Isolating Low Moderately complex Complex tone system Eurasia
Wichí wich1264 MonoV Balanced Isolating Low Moderately complex No tones South America
Yagua yagu1244 MonoV Morph Concatenative Mid High Moderately complex Simple tone system South America
Yaqui yaqu1251 Poly Balanced Concatenative Mid High Moderately complex Simple tone system North America
Yoruba yoru1245 MonoVN Lexical Other Mid High Simple Complex tone system Africa
Figure 1. a) Geographical distribution. b) Distribution of the languages among WALS classical typological features and symbolic codes. Marker color and shape respectively encodes the fusion strategy and the exponence category. Marker size further indicates whether verbal inflection is limited (small size for Low values) or more extended (large size for Mid and High values). In each cell, the number of languages is displayed when different from zero.

Figure 1. a) Geographical distribution. b) Distribution of the languages among WALS classical typological features and symbolic codes. Marker color and shape respectively encodes the fusion strategy and the exponence category. Marker size further indicates whether verbal inflection is limited (small size for Low values) or more extended (large size for Mid and High values). In each cell, the number of languages is displayed when different from zero.

The languages belonging to each category of Figure 1b are displayed in the table below.

List of languages
Isolating-Low Isolating-Mid/High Concatenative-Low Concatenative-Mid/High Other-Low Other-Mid/High
None NA NA Sango NA NA NA
MonoV Vietnamese, Wichí Fijian, Hausa, Indonesian English Alamblak, Amele, Apurinã, Arapesh (Mountain), Guaraní, Jakaltek, Mixtec (Chalcatongo), Swahili, Yagua Thai NA
MonoVN NA NA Burmese, Khalkha Barasano, Basque, Korean, Malagasy, Oromo (Harar), Persian, Quechua (Imbabura), Sanumá, Turkish NA Khoekhoe, Yoruba
Poly NA NA Finnish, German, Hindi, Tagalog Daga, French, Georgian, Greek (Modern), Greenlandic (West), Kewa, Mapudungun, Russian, Spanish, Yaqui NA Arabic (Egyptian), Chamorro, Lango

4 Morphological complexity

4.1 Grammar-based morphological complexity indices

Two metrics of Morphological Complexity are acquired and computed from a top-down typological approach by means of two linguistic databases: Grammar-based Morphological Complexity derived from WALS (GMC_W) and Grammar-based Morphological Complexity derived from AUTOTYP (GMC_A).

4.1.1 Grammar-based morphological complexity derived from WALS (GMC_W)

The score of GMC_W is calculated by distinguishing between lexical and inflectional coding strategies and summing assigned values (-1 for lexical and 0 for morphological strategies) to the linguistic features which are accounted for by continuous or categorical variables. Continuous variables, such as the number of case categories (WALS feature 49A) and the number of grammatical categories expressed by the inflectional synthesis of the verb (WALS feature 22A), are normalized between 0 and -1 to better represent the degree of morphological complexity. The score is obtained by dividing the overall sum by the total number of available linguistic features in each language. The 29 morphological features derived from WALS are detailed in the code below.

# Load a list of languages
languageList <- data[,c("Language")]

# Load information for each parameter (29 in total) obtained from WALS (Downloaded online from https://wals.info in January, 2022)

# 1) 20A: Fusion of Selected Inflectional Formatives
X20A <- read_delim("./20A.txt", "\t", escape_double = FALSE, trim_ws = TRUE)
allList <- X20A[X20A$Language %in% languageList$Language,]
allList$S20A <- 0
allList[!is.na(allList$`20A`) & allList$`20A`=="Exclusively tonal",]$S20A <- -1
allList[!is.na(allList$`20A`) & allList$`20A`=="Exclusively isolating",]$S20A <- -1
allList[!is.na(allList$`20A`) & allList$`20A`=="Isolating/concatenative",]$S20A <- -0.5
allList[!is.na(allList$`20A`) & allList$`20A`=="Tonal/isolating",]$S20A <- -0.5
allList[!is.na(allList$`20A`) & allList$`20A`=="Ablaut/concatenative",]$S20A <- 0
allList[!is.na(allList$`20A`) & allList$`20A`=="Exclusively concatenative",]$S20A <- 0
remove(X20A)

# 2) 26A: Prefixing vs. Suffixing in Inflectional Morphology
X26A <- read_delim("./26A.txt", "\t", escape_double = FALSE, trim_ws = TRUE)
allList <- merge(X26A, allList, by="Language", all=TRUE)
allList <- allList[allList$Language %in% languageList$Language,]
allList$S26A <- 0
allList[!is.na(allList$`26A`) & allList$`26A`=="Strongly suffixing",]$S26A <- 0
allList[!is.na(allList$`26A`) & allList$`26A`=="Equal prefixing and suffixing",]$S26A <- 0 
allList[!is.na(allList$`26A`) & allList$`26A`=="Weakly suffixing",]$S26A <- 0
allList[!is.na(allList$`26A`) & allList$`26A`=="Weakly prefixing",]$S26A <- 0
allList[!is.na(allList$`26A`) & allList$`26A`=="Little affixation",]$S26A <- -1
allList[!is.na(allList$`26A`) & allList$`26A`=="Strong prefixing",]$S26A <- 0
remove(X26A)
 
# 3) 49A: Number of Cases
X49A <- read_delim("./49A.txt", "\t", escape_double = FALSE, trim_ws = TRUE)
allList <- merge(X49A, allList, by="Language", all=TRUE)
allList <- allList[allList$Language %in% languageList$Language,]
allList$S49A <- 0
allList[!is.na(allList$`49A`) & allList$`49A`=="No morphological case-marking",]$S49A <- -1
allList[!is.na(allList$`49A`) & allList$`49A`=="Exclusively borderline case-marking",]$S49A <- -0.5
allList[!is.na(allList$`49A`) & allList$`49A`=="2 cases",]$S49A <- -0.85
allList[!is.na(allList$`49A`) & allList$`49A`=="3 cases",]$S49A <- -0.7
allList[!is.na(allList$`49A`) & allList$`49A`=="4 cases",]$S49A <- -0.55
allList[!is.na(allList$`49A`) & allList$`49A`=="6-7 cases",]$S49A <- -0.4
allList[!is.na(allList$`49A`) & allList$`49A`=="8-9 cases",]$S49A <- -0.25
allList[!is.na(allList$`49A`) & allList$`49A`=="10 or more cases",]$S49A <- 0
remove(X49A)

# 4) 28A: Case Syncretism
X28A <- read_delim("./28A.txt", "\t", escape_double = FALSE, trim_ws = TRUE)
allList <- merge(X28A, allList, by="Language", all=TRUE)
allList <- allList[allList$Language %in% languageList$Language,]
allList$S28A <- 0
allList[!is.na(allList$`28A`) & allList$`28A`=="No case marking",]$S28A <- -1
allList[!is.na(allList$`28A`) & allList$`28A`=="No syncretism",]$S28A <- -1
allList[!is.na(allList$`28A`) & allList$`28A`=="Core cases only",]$S28A <- -0.5
allList[!is.na(allList$`28A`) & allList$`28A`=="Core and non-core",]$S28A <- 0
remove(X28A)

# 5) 98A: Alignment of Case Marking of Full Noun Phrases
X98A <- read_delim("./98A.txt", "\t", escape_double = FALSE, trim_ws = TRUE)
allList <- merge(X98A, allList, by="Language", all=TRUE)
allList <- allList[allList$Language %in% languageList$Language,]
allList$S98A <- 0
allList[!is.na(allList$`98A`) & allList$`98A`=="Neutral",]$S98A <- -1
allList[!is.na(allList$`98A`) & allList$`98A`=="Nominative - accusative (standard)",]$S98A <- 0
allList[!is.na(allList$`98A`) & allList$`98A`=="Active-inactive",]$S98A <- 0
allList[!is.na(allList$`98A`) & allList$`98A`=="Ergative - absolutive",]$S98A <- 0
allList[!is.na(allList$`98A`) & allList$`98A`=="Nominative - accusative (marked nominative)",]$S98A <- 0
allList[!is.na(allList$`98A`) & allList$`98A`=="Tripartite",]$S98A <- 0
remove(X98A)

# 6) 22A: Inflectional Synthesis of the Verb
X22A <- read_delim("./22A.txt", "\t", escape_double = FALSE, trim_ws = TRUE)
allList <- merge(X22A, allList, by="Language", all=TRUE)
allList <- allList[allList$Language %in% languageList$Language,]
allList$S22A <- 0
allList[!is.na(allList$`22A`) & allList$`22A`=="10-11 categories per word",]$S22A <- 0
allList[!is.na(allList$`22A`) & allList$`22A`=="8-9 categories per word",]$S22A <- -0.2
allList[!is.na(allList$`22A`) & allList$`22A`=="6-7 categories per word",]$S22A <- -0.4
allList[!is.na(allList$`22A`) & allList$`22A`=="4-5 categories per word",]$S22A <- -0.6
allList[!is.na(allList$`22A`) & allList$`22A`=="2-3 categories per word",]$S22A <- -0.8
allList[!is.na(allList$`22A`) & allList$`22A`=="0-1 category per word",]$S22A <- -1
remove(X22A)

# 7) 100A: Alignment of Verbal Person Marking
X100A <- read_delim("./100A.txt", "\t", escape_double = FALSE, trim_ws = TRUE)
allList <- merge(X100A, allList, by="Language", all=TRUE)
allList <- allList[allList$Language %in% languageList$Language,]
allList$S100A <- 0
allList[!is.na(allList$`100A`) & allList$`100A`=="Neutral",]$S100A <- -1
remove(X100A)

# 8) 102A: Verbal Person Marking
X102A <- read_delim("./102A.txt", "\t", escape_double = FALSE, trim_ws = TRUE)
allList <- merge(X102A, allList, by="Language", all=TRUE)
allList <- allList[allList$Language %in% languageList$Language,]
allList$S102A <- 0
allList[!is.na(allList$`102A`) & allList$`102A`=="No person marking",]$S102A <- -1
allList[!is.na(allList$`102A`) & allList$`102A`=="Only the A argument",]$S102A <- -0.5
allList[!is.na(allList$`102A`) & allList$`102A`=="Only the P argument",]$S102A <- -0.5
allList[!is.na(allList$`102A`) & allList$`102A`=="Both the A and P arguments",]$S102A <- 0
remove(X102A)

# 9) 48A: Person Marking on Adpositions
X48A <- read_delim("./48A.txt", "\t", escape_double = FALSE, trim_ws = TRUE)
allList <- merge(X48A, allList, by="Language", all=TRUE)
allList <- allList[allList$Language %in% languageList$Language,]
allList$S48A <- 0
allList[!is.na(allList$`48A`) & allList$`48A`=="No person marking",]$S48A <- -1
allList[!is.na(allList$`48A`) & allList$`48A`=="No adpositions",]$S48A <- -1
allList[!is.na(allList$`48A`) & allList$`48A`=="Pronouns only",]$S48A <- -0.5
allList[!is.na(allList$`48A`) & allList$`48A`=="Pronouns and nouns",]$S48A <- 0
remove(X48A)

# 10) 29A: Syncretism in Verbal Person/Number Marking
X29A <- read_delim("./29A.txt", "\t", escape_double = FALSE, trim_ws = TRUE)
allList <- merge(X29A, allList, by="Language", all=TRUE)
allList <- allList[allList$Language %in% languageList$Language,]
allList$S29A <- 0
allList[!is.na(allList$`29A`) & allList$`29A`=="No subject person/number marking",]$S29A <- -1
allList[!is.na(allList$`29A`) & allList$`29A`=="Not syncretic",]$S29A <- -1
allList[!is.na(allList$`29A`) & allList$`29A`=="Syncretic",]$S29A <- 0
remove(X29A)

# 11) 74A: Situational Possibility
X74A <- read_delim("./74A.txt", "\t", escape_double = FALSE, trim_ws = TRUE)
allList <- merge(X74A, allList, by="Language", all=TRUE)
allList <- allList[allList$Language %in% languageList$Language,]
allList$S74A <- 0
allList[!is.na(allList$`74A`) & allList$`74A`=="Verbal constructions",]$S74A <- -1
allList[!is.na(allList$`74A`) & allList$`74A`=="Other kinds of markers",]$S74A <- -1
allList[!is.na(allList$`74A`) & allList$`74A`=="Affixes on verbs",]$S74A <- 0
remove(X74A)

# 12) 75A: Epistemic Possibility
X75A <- read_delim("./75A.txt", "\t", escape_double = FALSE, trim_ws = TRUE)
allList <- merge(X75A, allList, by="Language", all=TRUE)
allList <- allList[allList$Language %in% languageList$Language,]
allList$S75A <- 0
allList[!is.na(allList$`75A`) & allList$`75A`=="Verbal constructions",]$S75A <- -1
allList[!is.na(allList$`75A`) & allList$`75A`=="Other",]$S75A <- -1
allList[!is.na(allList$`75A`) & allList$`75A`=="Affixes on verbs",]$S75A <- 0
remove(X75A)

# 13) 76A: Overlap between Situational and Epistemic Modal Marking
X76A <- read_delim("./76A.txt", "\t", escape_double = FALSE, trim_ws = TRUE)
allList <- merge(X76A, allList, by="Language", all=TRUE)
allList <- allList[allList$Language %in% languageList$Language,]
allList$S76A <- 0
allList[!is.na(allList$`76A`) & allList$`76A`=="No overlap",]$S76A <- -1
allList[!is.na(allList$`76A`) & allList$`76A`=="Overlap for either possibility or necessity",]$S76A <- -0.5
allList[!is.na(allList$`76A`) & allList$`76A`=="Overlap for both possibility and necessity",]$S76A <- 0
remove(X76A)

# 14) 77A: Semantic Distinctions of Evidentiality
X77A <- read_delim("./77A.txt", "\t", escape_double = FALSE, trim_ws = TRUE)
allList <- merge(X77A, allList, by="Language", all=TRUE)
allList <- allList[allList$Language %in% languageList$Language,]
allList$S77A <- 0
allList[!is.na(allList$`77A`) & allList$`77A`=="No grammatical evidentials",]$S77A <- -1
allList[!is.na(allList$`77A`) & allList$`77A`=="Indirect only",]$S77A <- -0.5
allList[!is.na(allList$`77A`) & allList$`77A`=="Direct and indirect",]$S77A <- 0
remove(X77A)

# 15) 112A: Negative Morphemes
X112A <- read_delim("./112A.txt", "\t", escape_double = FALSE, trim_ws = TRUE)
allList <- merge(X112A, allList, by="Language", all=TRUE)
allList <- allList[allList$Language %in% languageList$Language,]
allList$S112A <- 0
allList[!is.na(allList$`112A`) & allList$`112A`=="Negative particle",]$S112A <- 0
allList[!is.na(allList$`112A`) & allList$`112A`=="Negative auxiliary verb",]$S112A <- -1
allList[!is.na(allList$`112A`) & allList$`112A`=="Negative word, unclear if verb or particle",]$S112A <- -1
allList[!is.na(allList$`112A`) & allList$`112A`=="Double negation",]$S112A <- 0
allList[!is.na(allList$`112A`) & allList$`112A`=="Negative affix",]$S112A <- 0
remove(X112A)

# 16) 34A: Occurrence of Nominal Plurality
X34A <- read_delim("./34A.txt", "\t", escape_double = FALSE, trim_ws = TRUE)
allList <- merge(X34A, allList, by="Language", all=TRUE)
allList <- allList[allList$Language %in% languageList$Language,]
allList$S34A <- 0
allList[!is.na(allList$`34A`) & allList$`34A`=="No nominal plural",]$S34A <- -1
allList[!is.na(allList$`34A`) & allList$`34A`=="All nouns, always optional",]$S34A <- -0.5
allList[!is.na(allList$`34A`) & allList$`34A`=="Only human nouns, optional",]$S34A <- -0.5
allList[!is.na(allList$`34A`) & allList$`34A`=="Only human nouns, obligatory",]$S34A <- -0.5
allList[!is.na(allList$`34A`) & allList$`34A`=="All nouns, always obligatory",]$S34A <- 0
remove(X34A)

# 17) 36A: The Associative Plural
X36A <- read_delim("./36A.txt", "\t", escape_double = FALSE, trim_ws = TRUE)
allList <- merge(X36A, allList, by="Language", all=TRUE)
allList <- allList[allList$Language %in% languageList$Language,]
allList$S36A <- 0
allList[!is.na(allList$`36A`) & allList$`36A`=="No associative plural",]$S36A <- -1
allList[!is.na(allList$`36A`) & allList$`36A`=="Unique periphrastic associative plural",]$S36A <- 0
allList[!is.na(allList$`36A`) & allList$`36A`=="Unique affixal associative plural",]$S36A <- 0
allList[!is.na(allList$`36A`) & allList$`36A`=="Associative same as additive plural",]$S36A <- 0
remove(X36A)

# 18) 92A: Position of Polar Question Particles
X92A <- read_delim("./92A.txt", "\t", escape_double = FALSE, trim_ws = TRUE)
allList <- merge(X92A, allList, by="Language", all=TRUE)
allList <- allList[allList$Language %in% languageList$Language,]
allList$S92A <- 0
allList[!is.na(allList$`92A`) & allList$`92A`=="No question particle",]$S92A <- -1
allList[!is.na(allList$`92A`) & allList$`92A`=="Final",]$S92A <- 0
allList[!is.na(allList$`92A`) & allList$`92A`=="Initial",]$S92A <- 0
allList[!is.na(allList$`92A`) & allList$`92A`=="Other position",]$S92A <- 0
allList[!is.na(allList$`92A`) & allList$`92A`=="Second position",]$S92A <- 0
allList[!is.na(allList$`92A`) & allList$`92A`=="In either of two positions",]$S92A <- 0
remove(X92A)

# 19) 67A: The Future Tense
X67A <- read_delim("./67A.txt", "\t", escape_double = FALSE, trim_ws = TRUE)
allList <- merge(X67A, allList, by="Language", all=TRUE)
allList <- allList[allList$Language %in% languageList$Language,]
allList$S67A <- 0
allList[!is.na(allList$`67A`) & allList$`67A`=="No inflectional future",]$S67A <- -1
allList[!is.na(allList$`67A`) & allList$`67A`=="Inflectional future exists",]$S67A <- 0
remove(X67A)

# 20) 66A: The Past Tense
X66A <- read_delim("./66A.txt", "\t", escape_double = FALSE, trim_ws = TRUE)
allList <- merge(X66A, allList, by="Language", all=TRUE)
allList <- allList[allList$Language %in% languageList$Language,]
allList$S66A <- 0
allList[!is.na(allList$`66A`) & allList$`66A`=="Present, 4 or more remoteness distinctions",]$S66A <- 0
allList[!is.na(allList$`66A`) & allList$`66A`=="Present, 2-3 remoteness distinctions",]$S66A <- -0.33
allList[!is.na(allList$`66A`) & allList$`66A`=="Present, no remoteness distinctions",]$S66A <- -0.67
allList[!is.na(allList$`66A`) & allList$`66A`=="No past tense",]$S66A <- -1
remove(X66A)

# 21) 65A: Perfective/Imperfective Aspect
X65A <- read_delim("./65A.txt", "\t", escape_double = FALSE, trim_ws = TRUE)
allList <- merge(X65A, allList, by="Language", all=TRUE)
allList <- allList[allList$Language %in% languageList$Language,]
allList$S65A <- 0
allList[!is.na(allList$`65A`) & allList$`65A`=="No grammatical marking",]$S65A <- -1
allList[!is.na(allList$`65A`) & allList$`65A`=="Grammatical marking",]$S65A <- 0
remove(X65A)

# 22) 70A: The Morphological Imperative
X70A <- read_delim("./70A.txt", "\t", escape_double = FALSE, trim_ws = TRUE)
allList <- merge(X70A, allList, by="Language", all=TRUE)
allList <- allList[allList$Language %in% languageList$Language,]
allList$S70A <- 0
allList[!is.na(allList$`70A`) & allList$`70A`=="No second-person imperatives",]$S70A <- -1
allList[!is.na(allList$`70A`) & allList$`70A`=="Second singular and second plural",]$S70A <- 0
allList[!is.na(allList$`70A`) & allList$`70A`=="Second plural",]$S70A <- 0
allList[!is.na(allList$`70A`) & allList$`70A`=="Second person number-neutral",]$S70A <- 0
allList[!is.na(allList$`70A`) & allList$`70A`=="Second singular",]$S70A <- 0
remove(X70A)

# 23) 57A: Position of Pronominal Possessive Affixes
X57A <- read_delim("./57A.txt", "\t", escape_double = FALSE, trim_ws = TRUE)
allList <- merge(X57A, allList, by="Language", all=TRUE)
allList <- allList[allList$Language %in% languageList$Language,]
allList$S57A <- 0
allList[!is.na(allList$`57A`) & allList$`57A`=="No possessive affixes",]$S57A <- -1
allList[!is.na(allList$`57A`) & allList$`57A`=="Possessive suffixes",]$S57A <- 0
allList[!is.na(allList$`57A`) & allList$`57A`=="Possessive prefixes",]$S57A <- 0
remove(X57A)

# 24) 59A: Possessive Classification
X59A <- read_delim("./59A.txt", "\t", escape_double = FALSE, trim_ws = TRUE)
allList <- merge(X59A, allList, by="Language", all=TRUE)
allList <- allList[allList$Language %in% languageList$Language,]
allList$S59A <- 0
allList[!is.na(allList$`59A`) & allList$`59A`=="No possessive classification",]$S59A <- -1
allList[!is.na(allList$`59A`) & allList$`59A`=="Two classes",]$S59A <- -0.67
allList[!is.na(allList$`59A`) & allList$`59A`=="Three to five classes",]$S59A <- -0.33
allList[!is.na(allList$`59A`) & allList$`59A`=="More than five classes",]$S59A <- 0
remove(X59A)

# 25) 73A: The Optative
X73A <- read_delim("./73A.txt", "\t", escape_double = FALSE, trim_ws = TRUE)
allList <- merge(X73A, allList, by="Language", all=TRUE)
allList <- allList[allList$Language %in% languageList$Language,]
allList$S73A <- 0
allList[!is.na(allList$`73A`) & allList$`73A`=="Inflectional optative absent",]$S73A <- -1
allList[!is.na(allList$`73A`) & allList$`73A`=="Inflectional optative present",]$S73A <- 0
remove(X73A)

# 26) 37A: Definite Articles
X37A <- read_delim("./37A.txt", "\t", escape_double = FALSE, trim_ws = TRUE)
allList <- merge(X37A, allList, by="Language", all=TRUE)
allList <- allList[allList$Language %in% languageList$Language,]
allList$S37A <- 0
allList[!is.na(allList$`37A`) & allList$`37A`=="No definite or indefinite article",]$S37A <- -1
allList[!is.na(allList$`37A`) & allList$`37A`=="No definite, but indefinite article",]$S37A <- -1
allList[!is.na(allList$`37A`) & allList$`37A`=="Definite affix",]$S37A <- 0
allList[!is.na(allList$`37A`) & allList$`37A`=="Definite word distinct from demonstrative",]$S37A <- -1
allList[!is.na(allList$`37A`) & allList$`37A`=="Demonstrative word used as definite article",]$S37A <- -1
remove(X37A)
   
# 27) 38A: Indefinite Articles
X38A <- read_delim("./38A.txt", "\t", escape_double = FALSE, trim_ws = TRUE)
allList <- merge(X38A, allList, by="Language", all=TRUE)
allList <- allList[allList$Language %in% languageList$Language,]
allList$S38A <- 0
allList[!is.na(allList$`38A`) & allList$`38A`=="No definite or indefinite article",]$S38A <- -1
allList[!is.na(allList$`38A`) & allList$`38A`=="No indefinite, but definite article",]$S38A <- -1
allList[!is.na(allList$`38A`) & allList$`38A`=="Indefinite affix",]$S38A <- 0
allList[!is.na(allList$`38A`) & allList$`38A`=="Indefinite word distinct from 'one'",]$S38A <- -1
allList[!is.na(allList$`38A`) & allList$`38A`=="Indefinite word same as 'one'",]$S38A <- -1
remove(X38A)

# 28) 41A: Distance Contrasts in Demonstratives
X41A <- read_delim("./41A.txt", "\t", escape_double = FALSE, trim_ws = TRUE)
allList <- merge(X41A, allList, by="Language", all=TRUE)
allList <- allList[allList$Language %in% languageList$Language,]
allList$S41A <- 0
allList[!is.na(allList$`41A`) & allList$`41A`=="No distance contrast",]$S41A <- -1
allList[!is.na(allList$`41A`) & allList$`41A`=="Two-way contrast",]$S41A <- -0.75
allList[!is.na(allList$`41A`) & allList$`41A`=="Three-way contrast",]$S41A <- -0.5
allList[!is.na(allList$`41A`) & allList$`41A`=="Four-way contrast",]$S41A <- -0.25
allList[!is.na(allList$`41A`) & allList$`41A`=="Five (or more)-way contrast",]$S41A <- 0 
remove(X41A)

# 29) 101A: Expression of Pronominal Subjects
X101A <- read_delim("./101A.txt", "\t", escape_double = FALSE, trim_ws = TRUE)
allList <- merge(X101A, allList, by="Language", all=TRUE)
allList <- allList[allList$Language %in% languageList$Language,]
allList$S101A <- 0
allList[!is.na(allList$`101A`) & allList$`101A`=="Subject pronouns in different position",]$S101A <- -1
allList[!is.na(allList$`101A`) & allList$`101A`=="Obligatory pronouns in subject position",]$S101A <- -1
allList[!is.na(allList$`101A`) & allList$`101A`=="Mixed",]$S101A <- -1
allList[!is.na(allList$`101A`) & allList$`101A`=="Subject clitics on variable host",]$S101A <- 0
allList[!is.na(allList$`101A`) & allList$`101A`=="Optional pronouns in subject position",]$S101A <- -1
allList[!is.na(allList$`101A`) & allList$`101A`=="Subject affixes on verb",]$S101A <- 0
remove(X101A)

# Sum all values
allList$sum <- apply(allList[,c("S20A","S26A","S49A","S28A","S98A","S22A","S100A","S102A","S48A","S29A","S74A","S75A","S76A","S77A","S112A","S34A","S36A","S92A","S67A","S66A","S65A","S70A","S57A","S59A","S73A","S37A","S38A","S41A","S101A")],1,sum)

# Get the number of NAs for each language and normalize the sum by the number of available features
allListFeatures <- allList[,c("Language","20A","26A","49A","28A","98A","22A","100A","102A","48A","29A","74A","75A","76A","77A","112A","34A","36A","92A","67A","66A","65A","70A","57A","59A","73A","37A","38A","41A","101A")]
allListFeatures$sumNA <- apply(is.na(allListFeatures),1,sum)
allListFeatures$totalF <- 29 - allListFeatures$sumNA
allListFeatures <- allListFeatures[,c("Language","totalF")]
allList <- merge(allList, allListFeatures, by="Language")
allList$GMC_W <- allList$sum/allList$totalF
GMC_W <- allList[,c("Language","GMC_W")]
Grammar-based Morphological Complexity based on WALS (GMC_W). On the x-axis, languages are ordered by increasing GMC_W values from left to right. GMC_W is by definition normalized between -1 and 0. Marker convention is shown in Figure 1b (also in Supplementary Information (§8.1)).

Grammar-based Morphological Complexity based on WALS (GMC_W). On the x-axis, languages are ordered by increasing GMC_W values from left to right. GMC_W is by definition normalized between -1 and 0. Marker convention is shown in Figure 1b (also in Supplementary Information (§8.1)).

4.1.2 Grammar-based morphological complexity derived from AUTOTYP (GMC_A)

Another Grammar-based Morphological Complexity metric (GMC_A) is derived from AUTOTYP Bickel & Nichols 2013, Bickel et al. 2017 by integrating the degree of inflectional synthesis of verbs. For Malagasy and Wichí, two languages with missing information, we used a score obtained from their variant language with a distinct Glottocode (Malagasy (mala1537) and Mataco (wich1263), respectively).

# Load a list of languages and merge with autotyp dataset to get a glottocode for each language
languageList <- data[,c("Language","Glottocode")]
languageList <- merge(autotyp, languageList, by="Glottocode")

# Add LID for Malagasy (mala1537) and Mataco (wich1263)
languageList <- rbind(languageList, c("mala1537","491","Malagasy"))
languageList <- rbind(languageList, c("wich1263","180","Wichí"))

# Load information downloaded from Github (Downloaded online from https://github.com/autotyp/autotyp-data/tree/0.1.0 in April, 2022)
Synthesis <- read_csv("./Synthesis.csv")
Synthesis <- unique(Synthesis[,c("LID","VInflCatAndAgrMax.n")])
Synthesis <- merge(Synthesis, languageList, by="LID")
Synthesis <- Synthesis[,c("Language","VInflCatAndAgrMax.n")]
colnames(Synthesis) <- c("Language","GMC_A")
Grammar-based verbal inflectional complexity based on AUTOTYP (GMC_A). On the x-axis, languages are ordered by increasing GMC_A values from left to right. Marker convention is shown in Figure 1b (also in Supplementary Information (§8.1)).

Grammar-based verbal inflectional complexity based on AUTOTYP (GMC_A). On the x-axis, languages are ordered by increasing GMC_A values from left to right. Marker convention is shown in Figure 1b (also in Supplementary Information (§8.1)).

4.1.3 Grammar-based morphological complexities (GMC_W & GMC_A)

Language distribution in a two-dimensional space defined by Grammar-based Morphological Complexities based on WALS (GMC_W, x-axis) and AUTOTYP (GMC_A, y-axis). Marker convention is shown in Figure 1b (also in Supplementary Information (§8.1)).

Language distribution in a two-dimensional space defined by Grammar-based Morphological Complexities based on WALS (GMC_W, x-axis) and AUTOTYP (GMC_A, y-axis). Marker convention is shown in Figure 1b (also in Supplementary Information (§8.1)).

4.2 Towards robust indices of morphological complexity

The following four metrics estimate morphological complexity by means of the Parallel Bible Corpus: Word Information Density (WID), Type-Token Ratio (TTR), Measure of Textual Lexical Diversity (MTLD), and word-level Entropy (H).

4.2.1 Word Information Density (WID)

Word Information Density (WID) is calculated by a pairwise comparison between the number of words in English (our reference language) and a target language L, in a variable number of subsets (Whole, 5, 10, 20, 40, and 60 subsets).

# Total number of subsets: 1 (1,150 verses per subset)
# Load a list of bible text file names and language codes and count the number of words in each subset in English (Parallel Bible Corpus can be downloaded from http://www.christianbentz.de/MLC2019_data.html)
listBible <- data[,c("Language","BibleFile")]
bibleENG <- listBible[listBible$Language=="English",]$BibleFile
bibleENG <- read_delim(bibleENG, "\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
wordCount <- as.data.frame(sapply(strsplit(bibleENG$X1, " "), length))
colnames(wordCount)[1] <- "ENG"
wordCount$ID <- seq_along(wordCount$ENG)

# Function to calculate Word Information Density (WID), using English as a reference
 wordCountCal <- function(bible, Language){
  bible <- read_delim(bible,"\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
  bible <- as.data.frame(sapply(strsplit(bible$X1, " "), length))
  colnames(bible)[1] <- Language
  bible$ID <- seq_along(bible[,1])
  wordCount <- merge(wordCount,bible, by="ID")
  WID <- sum(wordCount$ENG)/sum(wordCount[,ncol(wordCount)])
  langWID <- c(Language, WID)
  return(langWID)
 }

# Calculate WID by using the list of bible text file names and language codes
 listBible <- listBible[!listBible$Language=="English",]
 widList <- list()
 widLanList <- list()
 widLists1 <- list()
for(i in 1:nrow(listBible)){
 bible <- paste0("./",listBible$BibleFile[i])
 Language <- listBible$Language[i] 
 widList[[i]] <- wordCountCal(bible,Language)[2]
 widLanList[[i]] <- wordCountCal(bible,Language)[1]
 widLists1[[i]] <- cbind(unlist(widLanList[[i]]),unlist(widList[[i]]))
}
widLists1 <- as.data.frame(do.call(rbind, widLists1))
colnames(widLists1) <- c("Language","WID")
widLists1 <- widLists1[order(widLists1$WID, decreasing=TRUE),]

# Add English to the data
ENG <- data.frame("English","1")
names(ENG) <- c("Language","WID")
widLists1 <- rbind(widLists1, ENG)
 
# Get the mean of WID (whole set)
widLists1$WID <- as.numeric(as.character(widLists1$WID))
meanWID1 <- widLists1 %>% 
        group_by(Language) %>% 
        summarise(MeanWID= mean(WID))
 
# Get the standard deviation of WID (whole set)
sdWID1 <- widLists1 %>% 
        group_by(Language) %>% 
        summarise(SdWID = sd(WID))
 
meanWID1 <- merge(meanWID1, sdWID1, by="Language")
 
# Total number of subsets: 5 (230 verses per subset)
# Load a list of bible text file names and language codes and count the number of words in each subset in English
listBible <- data[,c("Language","BibleFile")]
bibleENG <- listBible[listBible$Language=="English",]$BibleFile
bibleENG <- read_delim(bibleENG, "\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
wordCount <- as.data.frame(sapply(strsplit(bibleENG$X1, " "), length))
colnames(wordCount)[1] <- "ENG"
wordCount$ID <- seq_along(wordCount$ENG)

# Function to calculate Word Information Density (WID), using English as a reference
 wordCountCal <- function(bible, Language){
  bible <- read_delim(bible,"\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
  bible <- as.data.frame(sapply(strsplit(bible$X1, " "), length))
  colnames(bible)[1] <- Language
  bible$ID <- seq_along(bible[,1])
  wordCount <- merge(wordCount,bible, by="ID")
  WID <- list()
  for(j in 1:5){
  k <- 230*(j-1) + 1
  l <- 230*j
  WID[[j]] <- sum(wordCount$ENG[k:l])/sum(wordCount[,ncol(wordCount)][k:l])
  }
  langWID <- c(Language, WID)
  return(langWID)
 }

# Calculate WID by using the list of bible text file names and language codes
listBible <- listBible[!listBible$Language=="English",]
widList <- list()
widLanList <- list()
widLists5 <- list()
for(i in 1:nrow(listBible)){
 bible <- paste0("./",listBible$BibleFile[i])
 Language <- listBible$Language[i] 
 widList[[i]] <- wordCountCal(bible,Language)[2:6]
 widLanList[[i]] <- wordCountCal(bible,Language)[1]
 widLists5[[i]] <- cbind(unlist(widLanList[[i]]),unlist(widList[[i]]))
}
widLists5 <- as.data.frame(do.call(rbind, widLists5))
colnames(widLists5) <- c("Language","WID")
widLists5 <- widLists5[order(widLists5$WID, decreasing=TRUE),]

# Add English to the data
ENG <- data.frame("English","1")
ENG <- ENG[rep(seq_len(nrow(ENG)), each = 5),]
names(ENG) <- c("Language","WID")
widLists5 <- rbind(widLists5, ENG)

# Get the mean of WID (5 subsets)
widLists5$WID <- as.numeric(as.character(widLists5$WID))
meanWID5 <- widLists5 %>% 
        group_by(Language) %>% 
        summarise(MeanWID= mean(WID))
 
# Get the standard deviation of WID (5 subsets)
sdWID5 <- widLists5 %>% 
        group_by(Language) %>% 
        summarise(SdWID = sd(WID)) 
 
meanWID5 <- merge(meanWID5, sdWID5, by="Language")

# Total number of subsets: 10 (115 verses per subset) 
# Load a list of bible text file names and language codes and count the number of words in each subset in English
listBible <- data[,c("Language","BibleFile")]
bibleENG <- listBible[listBible$Language=="English",]$BibleFile
bibleENG <- read_delim(bibleENG, "\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
wordCount <- as.data.frame(sapply(strsplit(bibleENG$X1, " "), length))
colnames(wordCount)[1] <- "ENG"
wordCount$ID <- seq_along(wordCount$ENG)

# Function to calculate Word Information Density (WID), using English as a reference
 wordCountCal <- function(bible, Language){
  bible <- read_delim(bible,"\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
  bible <- as.data.frame(sapply(strsplit(bible$X1, " "), length))
  colnames(bible)[1] <- Language
  bible$ID <- seq_along(bible[,1])
  wordCount <- merge(wordCount,bible, by="ID")
  WID <- list()
  for(j in 1:10){
  k <- 115*(j-1) + 1
  l <- 115*j
  WID[[j]] <- sum(wordCount$ENG[k:l])/sum(wordCount[,ncol(wordCount)][k:l])
  }
  langWID <- c(Language, WID)
  return(langWID)
 }
 
# Calculate WID by using the list of bible text file names and language codes
listBible <- listBible[!listBible$Language=="English",]
widList <- list()
widLanList <- list()
widLists10 <- list()
for(i in 1:nrow(listBible)){
 bible <- paste0("./",listBible$BibleFile[i])
 Language <- listBible$Language[i] 
 widList[[i]] <- wordCountCal(bible,Language)[2:11]
 widLanList[[i]] <- wordCountCal(bible,Language)[1]
 widLists10[[i]] <- cbind(unlist(widLanList[[i]]),unlist(widList[[i]]))
}
widLists10 <- as.data.frame(do.call(rbind, widLists10))
colnames(widLists10) <- c("Language","WID")
widLists10 <- widLists10[order(widLists10$WID, decreasing=TRUE),]

# Add English to the data
ENG <- data.frame("English","1")
ENG <- ENG[rep(seq_len(nrow(ENG)), each = 10),]
names(ENG) <- c("Language","WID")
widLists10 <- rbind(widLists10, ENG)

# Get the mean of WID (10 subsets)
widLists10$WID <- as.numeric(as.character(widLists10$WID))
meanWID10 <- widLists10 %>% 
        group_by(Language) %>% 
        summarise(MeanWID= mean(WID))
 
# Get the standard deviation of WID (10 subsets)
sdWID10 <- widLists10 %>% 
        group_by(Language) %>% 
        summarise(SdWID = sd(WID)) 
 
meanWID10 <- merge(meanWID10, sdWID10, by="Language")
 
# Total number of subsets: 20 (57 verses per subset) 
# Load a list of bible text file names and language codes and count the number of words in each subset in English
listBible <- data[,c("Language","BibleFile")]
bibleENG <- listBible[listBible$Language=="English",]$BibleFile
bibleENG <- read_delim(bibleENG, "\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
wordCount <- as.data.frame(sapply(strsplit(bibleENG$X1, " "), length))
colnames(wordCount)[1] <- "ENG"
wordCount$ID <- seq_along(wordCount$ENG)

# Function to calculate Word Information Density (WID), using English as a reference
 wordCountCal <- function(bible, Language){
  bible <- read_delim(bible,"\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
  bible <- as.data.frame(sapply(strsplit(bible$X1, " "), length))
  colnames(bible)[1] <- Language
  bible$ID <- seq_along(bible[,1])
  wordCount <- merge(wordCount,bible, by="ID")
  WID <- list()
  for(j in 1:20){
  k <- 57*(j-1) + 1
  l <- 57*j
  WID[[j]] <- sum(wordCount$ENG[k:l])/sum(wordCount[,ncol(wordCount)][k:l])
  }
  langWID <- c(Language, WID)
  return(langWID)
 }

# Calculate WID by using the list of bible text file names and language codes
listBible <- listBible[!listBible$Language=="English",]
widList <- list()
widLanList <- list()
widLists20 <- list()
for(i in 1:nrow(listBible)){
 bible <- paste0("./",listBible$BibleFile[i])
 Language <- listBible$Language[i] 
 widList[[i]] <- wordCountCal(bible,Language)[2:21]
 widLanList[[i]] <- wordCountCal(bible,Language)[1]
 widLists20[[i]] <- cbind(unlist(widLanList[[i]]),unlist(widList[[i]]))
}
widLists20 <- as.data.frame(do.call(rbind, widLists20))
colnames(widLists20) <- c("Language","WID")
widLists20 <- widLists20[order(widLists20$WID, decreasing=TRUE),]

# Add English to the data
ENG <- data.frame("English","1")
ENG <- ENG[rep(seq_len(nrow(ENG)), each = 20),]
names(ENG) <- c("Language","WID")
widLists20 <- rbind(widLists20, ENG)

# Get the mean of WID (20 subsets)
widLists20$WID <- as.numeric(as.character(widLists20$WID))
meanWID20 <- widLists20 %>% 
        group_by(Language) %>% 
        summarise(MeanWID= mean(WID))
 
# Get the standard deviation of WID (20 subsets)
sdWID20 <- widLists20 %>% 
        group_by(Language) %>% 
        summarise(SdWID = sd(WID)) 
 
meanWID20 <- merge(meanWID20, sdWID20, by="Language")
 
# Total number of subsets: 40 (28 verses per subset)
# Load a list of bible text file names and language codes and count the number of words in each subset in English
listBible <- data[,c("Language","BibleFile")]
bibleENG <- listBible[listBible$Language=="English",]$BibleFile
bibleENG <- read_delim(bibleENG, "\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
wordCount <- as.data.frame(sapply(strsplit(bibleENG$X1, " "), length))
colnames(wordCount)[1] <- "ENG"
wordCount$ID <- seq_along(wordCount$ENG)

# Function to calculate Word Information Density (WID), using English as a reference
 wordCountCal <- function(bible, Language){
  bible <- read_delim(bible,"\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
  bible <- as.data.frame(sapply(strsplit(bible$X1, " "), length))
  colnames(bible)[1] <- Language
  bible$ID <- seq_along(bible[,1])
  wordCount <- merge(wordCount,bible, by="ID")
  WID <- list()
  for(j in 1:40){
  k <- 28*(j-1) + 1
  l <- 28*j
  WID[[j]] <- sum(wordCount$ENG[k:l])/sum(wordCount[,ncol(wordCount)][k:l])
  }
  langWID <- c(Language, WID)
  return(langWID)
 }
 
# Calculate WID by using the list of bible text file names and language codes
listBible <- listBible[!listBible$Language=="English",]
widList <- list()
widLanList <- list()
widLists40 <- list()
for(i in 1:nrow(listBible)){
 bible <- paste0("./",listBible$BibleFile[i])
 Language <- listBible$Language[i] 
 widList[[i]] <- wordCountCal(bible,Language)[2:41]
 widLanList[[i]] <- wordCountCal(bible,Language)[1]
 widLists40[[i]] <- cbind(unlist(widLanList[[i]]),unlist(widList[[i]]))
}
widLists40 <- as.data.frame(do.call(rbind, widLists40))
colnames(widLists40) <- c("Language","WID")
widLists40 <- widLists40[order(widLists40$WID, decreasing=TRUE),]

# Add English to the data
ENG <- data.frame("English","1")
ENG <- ENG[rep(seq_len(nrow(ENG)), each = 40),]
names(ENG) <- c("Language","WID")
widLists40 <- rbind(widLists40, ENG)

# Get the mean of WID (40 subsets)
widLists40$WID <- as.numeric(as.character(widLists40$WID))
meanWID40 <- widLists40 %>% 
        group_by(Language) %>% 
        summarise(MeanWID= mean(WID))
 
# Get the standard deviation of WID (40 subsets)
sdWID40 <- widLists40 %>% 
        group_by(Language) %>% 
        summarise(SdWID = sd(WID)) 
 
 meanWID40 <- merge(meanWID40, sdWID40, by="Language")

# Total number of subsets: 60 (19 verses per subset)
# Load a list of bible text file names and language codes and count the number of words in each subset in English
listBible <- data[,c("Language","BibleFile")]
bibleENG <- listBible[listBible$Language=="English",]$BibleFile
bibleENG <- read_delim(bibleENG, "\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
wordCount <- as.data.frame(sapply(strsplit(bibleENG$X1, " "), length))
colnames(wordCount)[1] <- "ENG"
wordCount$ID <- seq_along(wordCount$ENG)

# Function to calculate Word Information Density (WID), using English as a reference
 wordCountCal <- function(bible, Language){
  bible <- read_delim(bible,"\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
  bible <- as.data.frame(sapply(strsplit(bible$X1, " "), length))
  colnames(bible)[1] <- Language
  bible$ID <- seq_along(bible[,1])
  wordCount <- merge(wordCount,bible, by="ID")
  WID <- list()
  for(j in 1:60){
  k <- 19*(j-1) + 1
  l <- 19*j
  WID[[j]] <- sum(wordCount$ENG[k:l])/sum(wordCount[,ncol(wordCount)][k:l])
  }
  langWID <- c(Language, WID)
  return(langWID)
 }
 
# Calculate WID by using the list of bible text file names and language codes
listBible <- listBible[!listBible$Language=="English",]
widList <- list()
widLanList <- list()
widLists60 <- list()
for(i in 1:nrow(listBible)){
 bible <- paste0("./",listBible$BibleFile[i])
 Language <- listBible$Language[i] 
 widList[[i]] <- wordCountCal(bible,Language)[2:61]
 widLanList[[i]] <- wordCountCal(bible,Language)[1]
 widLists60[[i]] <- cbind(unlist(widLanList[[i]]),unlist(widList[[i]]))
}
widLists60 <- as.data.frame(do.call(rbind, widLists60))
colnames(widLists60) <- c("Language","WID")
widLists60 <- widLists60[order(widLists60$WID, decreasing=TRUE),]

# Add English to the data
ENG <- data.frame("English","1")
ENG <- ENG[rep(seq_len(nrow(ENG)), each = 60),]
names(ENG) <- c("Language","WID")
widLists60 <- rbind(widLists60, ENG)

# Get the mean of WID (60 subsets)
widLists60$WID <- as.numeric(as.character(widLists60$WID))
meanWID60 <- widLists60 %>% 
        group_by(Language) %>% 
        summarise(MeanWID= mean(WID))
 
# Get the standard deviation of WID (60 subsets)
sdWID60 <- widLists60 %>% 
        group_by(Language) %>% 
        summarise(SdWID = sd(WID)) 
 
meanWID60 <- merge(meanWID60, sdWID60, by="Language")
 
# Save the results
meanWID1$NbS <- "1"
meanWID5$NbS <- "5"
meanWID10$NbS <- "10"
meanWID20$NbS <- "20"
meanWID40$NbS <- "40"
meanWID60$NbS <- "60" 
listMeanWID <- rbind(meanWID1, meanWID5, meanWID10, meanWID20, meanWID40, meanWID60)

widLists1$NbS <- "1"
widLists5$NbS <- "5"
widLists10$NbS <- "10"
widLists20$NbS <- "20"
widLists40$NbS <- "40"
widLists60$NbS <- "60"
listWID <- rbind(widLists1, widLists5, widLists10, widLists20, widLists40, widLists60)
listWID <- merge(listMeanWID, listWID, by=c("Language","NbS"))

4.2.1.1 WID by different corpus sampling configurations

The distribution of WID is displayed according to different corpus sampling configurations (Whole, 5, 10, 20, 40, and 60 subsets).

4.2.1.2 Rank of average WID by different corpus sampling configurations

The language ranks of the average WID are displayed according to different corpus sampling configurations (Whole, 5, 10, 20, 40, and 60 subsets).

The y-axis shows the language ranks according to the sampling configuration (whole set, 5, 10, 20, 40, and 60 subsets, on the x-axis). Languages are displayed in gray when rank is preserved throughout all configurations and in orange when changes occur, with orange edges underlying the changes.

The y-axis shows the language ranks according to the sampling configuration (whole set, 5, 10, 20, 40, and 60 subsets, on the x-axis). Languages are displayed in gray when rank is preserved throughout all configurations and in orange when changes occur, with orange edges underlying the changes.

4.2.2 Type-Token Ratio (TTR)

TTR is calculated as a ratio of vocabulary size (number of unique word types) and text length (total number of word tokens).

# Total number of subsets: 1 (1,150 verses per subset)
# Function to calculate TTR
ttrCal <- function(bible, Language){
  bible <- read_delim(bible,"\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
  colnames(bible)[1] <- "Language"
  text <- bible$Language[1:nrow(bible)]
  text <- tolower(paste(text,collapse=" "))
  type <- as.data.frame(table(strsplit(text," ")))
  ttr <- nrow(type)/sum(type$Freq)
  ttrV <- c(Language, ttr)
  return(ttrV)
 }

# Load a list of bible text file names and language names
listBible <- data[,c("Language","BibleFile")]

# Calculate TTR by using the list of bible text file names and language codes
ttrList <- list()
ttrLanList <- list()
ttrLists1 <- list()
 for(i in 1:nrow(listBible)){
  bible <- paste0("./",listBible$BibleFile[i])
  Language <- listBible$Language[i] 
  ttrList[[i]] <- ttrCal(bible,Language)[2]
  ttrLanList[[i]] <- ttrCal(bible,Language)[1]
  ttrLists1[[i]] <- cbind(unlist(ttrLanList[[i]]),unlist(ttrList[[i]]))
 }
ttrLists1 <- as.data.frame(do.call(rbind, ttrLists1))
colnames(ttrLists1) <- c("Language","TTR")
 
# Get the mean of TTR (whole set)
ttrLists1$TTR <- as.numeric(as.character(ttrLists1$TTR))
meanTTR1 <- ttrLists1 %>% 
  group_by(Language) %>% 
  summarise(MeanTTR= mean(TTR))

# Get the standard deviation of TTR (whole set)
sdTTR1 <- ttrLists1 %>% 
    group_by(Language) %>% 
    summarise(SdTTR= sd(TTR)) 

meanTTR1 <- merge(meanTTR1, sdTTR1, by="Language")

# Total number of subsets: 5 (230 verses per subset)
# Function to calculate TTR
ttrCal <- function(bible, Language){
  bible <- read_delim(bible,"\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
  colnames(bible)[1] <- "Language"
  ttrV <- list()
  for(j in 1:5){
  k <- 230*(j-1) + 1
  l <- 230*j
  text <- bible$Language[k:l]
  text <- tolower(paste(text,collapse=" "))
  type <- as.data.frame(table(strsplit(text," ")))
  ttr <- nrow(type)/sum(type$Freq)
  ttrV[[j]] <- ttr
  }
  ttrVList <- c(Language, ttrV)
  return(ttrVList)
}

# Load a list of bible text file names and language names
listBible <- data[,c("Language","BibleFile")]

# Calculate TTR by using the list of bible text file names and language codes
ttrList <- list()
ttrLanList <- list()
ttrLists5 <- list()
 for(i in 1:nrow(listBible)){
  bible <- paste0("./",listBible$BibleFile[i])
  Language <- listBible$Language[i] 
  ttrList[[i]] <- ttrCal(bible,Language)[2:6]
  ttrLanList[[i]] <- ttrCal(bible,Language)[1]
  ttrLists5[[i]] <- cbind(unlist(ttrLanList[[i]]),unlist(ttrList[[i]]))
 }
ttrLists5 <- as.data.frame(do.call(rbind, ttrLists5))
colnames(ttrLists5) <- c("Language","TTR")

# Get the mean of TTR (5 subsets)
ttrLists5$TTR <- as.numeric(as.character(ttrLists5$TTR))
meanTTR5 <- ttrLists5 %>% 
  group_by(Language) %>% 
  summarise(MeanTTR= mean(TTR))

# Get the standard deviation of TTR (5 subsets)
sdTTR5 <- ttrLists5 %>% 
  group_by(Language) %>% 
  summarise(SdTTR= sd(TTR)) 

meanTTR5 <- merge(meanTTR5, sdTTR5, by="Language")
 
# Total number of subsets: 10 (115 verses per subset)
# Function to calculate TTR
ttrCal <- function(bible, Language){
  bible <- read_delim(bible,"\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
  colnames(bible)[1] <- "Language"
  ttrV <- list()
  for(j in 1:10){
      k <- 115*(j-1) + 1
      l <- 115*j
  text <- bible$Language[k:l]
  text <- tolower(paste(text,collapse=" "))
  type <- as.data.frame(table(strsplit(text," ")))
  ttr <- nrow(type)/sum(type$Freq)
  ttrV[[j]] <- ttr
  }
  ttrVList <- c(Language, ttrV)
  return(ttrVList)
}

# Load a list of bible text file names and language names
listBible <- data[,c("Language","BibleFile")]

# Calculate TTR by using the list of bible text file names and language codes
ttrList <- list()
ttrLanList <- list()
ttrLists10 <- list()
 for(i in 1:nrow(listBible)){
  bible <- paste0("./",listBible$BibleFile[i])
  Language <- listBible$Language[i] 
  ttrList[[i]] <- ttrCal(bible,Language)[2:11]
  ttrLanList[[i]] <- ttrCal(bible,Language)[1]
  ttrLists10[[i]] <- cbind(unlist(ttrLanList[[i]]),unlist(ttrList[[i]]))
 }
ttrLists10 <- as.data.frame(do.call(rbind, ttrLists10))
colnames(ttrLists10) <- c("Language","TTR")

# Get the mean of TTR (10 subsets)
ttrLists10$TTR <- as.numeric(as.character(ttrLists10$TTR))
meanTTR10 <- ttrLists10 %>% 
  group_by(Language) %>% 
  summarise(MeanTTR= mean(TTR))

# Get the standard deviation of TTR (10 subsets)
sdTTR10 <- ttrLists10 %>% 
  group_by(Language) %>% 
  summarise(SdTTR = sd(TTR)) 

meanTTR10 <- merge(meanTTR10, sdTTR10, by="Language")

# Total number of subsets: 20 (57 verses per subset) 
# Function to calculate TTR
ttrCal <- function(bible, Language){
  bible <- read_delim(bible,"\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
  colnames(bible)[1] <- "Language"
  ttrV <- list()
  for(j in 1:20){
    k <- 57*(j-1) + 1
    l <- 57*j
  text <- bible$Language[k:l]
  text <- tolower(paste(text,collapse=" "))
  type <- as.data.frame(table(strsplit(text," ")))
  ttr <- nrow(type)/sum(type$Freq)
  ttrV[[j]] <- ttr
  }
  ttrVList <- c(Language, ttrV)
  return(ttrVList)
}

# Load a list of bible text file names and language names
listBible <- data[,c("Language","BibleFile")]

# Calculate TTR by using the list of bible text file names and language codes
ttrList <- list()
ttrLanList <- list()
ttrLists20 <- list()
for(i in 1:nrow(listFile)){
  bible <- paste0("./",listFile$bible[i])
  Language <- listFile$Language[i] 
  ttrList[[i]] <- ttrCal(bible,Language)[2:21]
  ttrLanList[[i]] <- ttrCal(bible,Language)[1]
  ttrLists20[[i]] <- cbind(unlist(ttrLanList[[i]]),unlist(ttrList[[i]]))
 }
ttrLists20 <- as.data.frame(do.call(rbind, ttrLists20))
colnames(ttrLists20) <- c("Language","TTR")

# Get the mean of TTR (20 subsets)
ttrLists20$TTR <- as.numeric(as.character(ttrLists20$TTR))
meanTTR20 <- ttrLists20 %>% 
  group_by(Language) %>% 
  summarise(MeanTTR= mean(TTR))

# Get the standard deviation of TTR (20 subsets)
sdTTR20 <- ttrLists20 %>% 
  group_by(Language) %>% 
  summarise(SdTTR = sd(TTR)) 

meanTTR20 <- merge(meanTTR20, sdTTR20, by="Language")

# Total number of subsets: 40 (28 verses per subset)
# Function to calculate TTR
ttrCal <- function(bible, Language){
  bible <- read_delim(bible,"\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
  colnames(bible)[1] <- "Language"
  ttrV <- list()
  for(j in 1:40){
    k <- 28*(j-1) + 1
    l <- 28*j
  text <- bible$Language[k:l]
  text <- tolower(paste(text,collapse=" "))
  type <- as.data.frame(table(strsplit(text," ")))
  ttr <- nrow(type)/sum(type$Freq)
  ttrV[[j]] <- ttr
  }
  ttrVList <- c(Language, ttrV)
  return(ttrVList)
}

# Load a list of bible text file names and language names
listBible <- data[,c("Language","BibleFile")]

# Calculate TTR by using the list of bible text file names and language codes
ttrList <- list()
ttrLanList <- list()
ttrLists40 <- list()
 for(i in 1:nrow(listBible)){
  bible <- paste0("./",listBible$BibleFile[i])
  Language <- listBible$Language[i] 
  ttrList[[i]] <- ttrCal(bible,Language)[2:41]
  ttrLanList[[i]] <- ttrCal(bible,Language)[1]
  ttrLists40[[i]] <- cbind(unlist(ttrLanList[[i]]),unlist(ttrList[[i]]))
 }
 ttrLists40 <- as.data.frame(do.call(rbind, ttrLists40))
 colnames(ttrLists40) <- c("Language","TTR")

# Get the mean of TTR (40 subsets)
ttrLists40$TTR <- as.numeric(as.character(ttrLists40$TTR))
meanTTR40 <- ttrLists40 %>% 
  group_by(Language) %>% 
  summarise(MeanTTR= mean(TTR))

# Get the standard deviation of TTR (40 subsets)
sdTTR40 <- ttrLists40 %>% 
  group_by(Language) %>% 
  summarise(SdTTR = sd(TTR)) 

meanTTR40 <- merge(meanTTR40, sdTTR40, by="Language")

# Total number of subsets: 60 (19 verses per subset)
# Function to calculate TTR
ttrCal <- function(bible, Language){
  bible <- read_delim(bible,"\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
  colnames(bible)[1] <- "Language"
  ttrV <- list()
  for(j in 1:60){
    k <- 19*(j-1) + 1
    l <- 19*j
  text <- bible$Language[k:l]
  text <- tolower(paste(text,collapse=" "))
  type <- as.data.frame(table(strsplit(text," ")))
  ttr <- nrow(type)/sum(type$Freq)
  ttrV[[j]] <- ttr
  }
  ttrVList <- c(Language, ttrV)
  return(ttrVList)
 }

# Load a list of bible text file names and language names
listBible <- data[,c("Language","BibleFile")]

# Calculate TTR by using the list of bible text file names and language codes
ttrList <- list()
ttrLanList <- list()
ttrLists60 <- list()
for(i in 1:nrow(listBible)){
  bible <- paste0("./",listBible$BibleFile[i])
  Language <- listBible$Language[i] 
  ttrList[[i]] <- ttrCal(bible,Language)[2:61]
  ttrLanList[[i]] <- ttrCal(bible,Language)[1]
  ttrLists60[[i]] <- cbind(unlist(ttrLanList[[i]]),unlist(ttrList[[i]]))
}
ttrLists60 <- as.data.frame(do.call(rbind, ttrLists60))
colnames(ttrLists60) <- c("Language","TTR")

# Get the mean of TTR (60 subsets)
ttrLists60$TTR <- as.numeric(as.character(ttrLists60$TTR))
meanTTR60 <- ttrLists60 %>% 
  group_by(Language) %>% 
  summarise(MeanTTR= mean(TTR))

# Get the standard deviation of TTR (60 subsets)
sdTTR60 <- ttrLists60 %>% 
  group_by(Language) %>% 
  summarise(SdTTR = sd(TTR)) 

meanTTR60 <- merge(meanTTR60, sdTTR60, by="Language")
 
# Save the results
meanTTR1$NbS <- "1"
meanTTR5$NbS <- "5"
meanTTR10$NbS <- "10"
meanTTR20$NbS <- "20"
meanTTR40$NbS <- "40"
meanTTR60$NbS <- "60" 
listMeanTTR <- rbind(meanTTR1, meanTTR5, meanTTR10, meanTTR20, meanTTR40, meanTTR60)

ttrLists1$NbS <- "1"
ttrLists5$NbS <- "5"
ttrLists10$NbS <- "10"
ttrLists20$NbS <- "20"
ttrLists40$NbS <- "40"
ttrLists60$NbS <- "60"
listTTR <- rbind(ttrLists1, ttrLists5, ttrLists10, ttrLists20, ttrLists40, ttrLists60)

listTTR <- merge(listMeanTTR, listTTR, by=c("Language","NbS"))

4.2.2.1 TTR by different corpus sampling configurations

The distribution of TTR is displayed according to different corpus sampling configurations (Whole, 5, 10, 20, 40, and 60 subsets).

4.2.2.2 Rank of average TTR by different corpus sampling configurations

The language ranks of the average TTR are displayed according to different corpus sampling configurations (Whole, 5, 10, 20, 40, and 60 subsets).

4.2.2.3 Average TTR by different corpus sampling configurations in each language

The average TTR in each language is displayed according to different corpus sampling configurations (Whole, 5, 10, 20, 40, and 60 subsets).

4.2.3 Measure of Textual Lexical Diversity (MTLD)

Another measure of lexical diversity, Measure of Textual Lexical Diversity (MTLD) is defined as the average number of words required within a text to reach the same TTR value of 0.72 (the threshold point of stabilization established in McCarthy & Jarvis 2010). MTLD is calculated by using a python code released by John Frens (downloaded from https://github.com/jennafrens/lexical_diversity/ in January 2022).

setwd("./lexical_diversity-master")
py_run_string("from lexical_diversity import mtld")

# Total number of subsets: 1 (1,150 verses per subset)
# Load a list of bible text file names and language names
listBible <- data[,c("Language","BibleFile")]

# Function to calculate MTLD
ldCal <- function(bible, Language){
  bible <- read_delim(bible,"\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
  colnames(bible)[1] <- "Language"
    text <- bible$Language[1:nrow(bible)]
    text <- tolower(paste(text,collapse=" "))
    write.table(text, "text.txt", quote=FALSE, col.name=FALSE, row.names=FALSE, sep = '\t')
    py_run_string('text=open("text.txt","r")')
    py_run_string('text=text.read()')
    ldValue <- str_replace_all(py_capture_output(py_run_string('print(mtld(text.split()))')),"[\r\n]","")
  langLd <- c(Language, ldValue)
  return(langLd)
}

# Calculate MTLD by using the list of bible text file names and language codes
mtldList <- list()
mtldLanList <- list()
mtldLists1 <- list()
for(i in 1:nrow(listBible)){
  bible <- paste0("./",listBible$BibleFile[i])
  Language <- listBible$Language[i] 
  mtldList[[i]] <- ldCal(bible,Language)[2]
  mtldLanList[[i]] <- ldCal(bible,Language)[1]
  mtldLists1[[i]] <- cbind(unlist(mtldLanList[[i]]),unlist(mtldList[[i]]))
}
mtldLists1 <- as.data.frame(do.call(rbind, mtldLists1))
colnames(mtldLists1) <- c("Language","MTLD")
 
# Get the mean of MTLD (whole set)
mtldLists1$MTLD <- as.numeric(as.character(mtldLists1$MTLD))
meanMTLD1 <- mtldLists1 %>% 
  group_by(Language) %>% 
  summarise(MeanMTLD= mean(MTLD))

# Get the standard deviation of MTLD (whole set)
sdMTLD1 <- mtldLists1 %>% 
  group_by(Language) %>% 
  summarise(SdMTLD= sd(MTLD)) 

meanMTLD1 <- merge(meanMTLD1, sdMTLD1, by="Language")

# Total number of subsets: 5 (230 verses per subset)
# Load a list of bible text file names and language names
listBible <- data[,c("Language","BibleFile")]

# Function to calculate MTLD
ldCal <- function(bible, Language){
  bible <- read_delim(bible,"\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
  colnames(bible)[1] <- "Language"
  ldList <- list()
  ldValue <- list()
  for(j in 1:5){
    k <- 230*(j-1) + 1
    l <- 230*j
    ldList[[j]] <- bible$Language[k:l]
    text <- ldList[[j]]
    text <- tolower(paste(text,collapse=" "))
    write.table(text, "text.txt", quote=FALSE, col.name=FALSE, row.names=FALSE, sep = '\t')
    py_run_string('text=open("text.txt","r")')
    py_run_string('text=text.read()')
    ldValue[[j]] <- str_replace_all(py_capture_output(py_run_string('print(mtld(text.split()))')),"[\r\n]","")
  }
  langLd <- c(Language, ldValue)
  return(langLd)
}

# Calculate MTLD by using the list of bible text file names and language codes
mtldList <- list()
mtldLanList <- list()
mtldLists5 <- list()
for(i in 1:nrow(listBible)){
  bible <- paste0("./",listBible$BibleFile[i])
  Language <- listBible$Language[i] 
  mtldList[[i]] <- ldCal(bible,Language)[2:6]
  mtldLanList[[i]] <- ldCal(bible,Language)[1]
  mtldLists5[[i]] <- cbind(unlist(mtldLanList[[i]]),unlist(mtldList[[i]]))
}
mtldLists5 <- as.data.frame(do.call(rbind, mtldLists5))
colnames(mtldLists5) <- c("Language","MTLD")

# Get the mean of MTLD (5 subsets)
mtldLists5$MTLD <- as.numeric(as.character(mtldLists5$MTLD))
meanMTLD5 <- mtldLists5 %>% 
  group_by(Language) %>% 
  summarise(MeanMTLD= mean(MTLD))

# Get the standard deviation of MTLD (5 subsets)
sdMTLD5 <- mtldLists5 %>% 
  group_by(Language) %>% 
  summarise(SdMTLD = sd(MTLD)) 

meanMTLD5 <- merge(meanMTLD5, sdMTLD5, by="Language")

# Total number of subsets: 10 (115 verses per subset)
# Load a list of bible text file names and language names
listBible <- data[,c("Language","BibleFile")]

# Function to calculate MTLD
ldCal <- function(bible, Language){
  bible <- read_delim(bible,"\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
  colnames(bible)[1] <- "Language"
  ldList <- list()
  ldValue <- list()
  for(j in 1:10){
      k <- 115*(j-1) + 1
      l <- 115*j
    ldList[[j]] <- bible$Language[k:l]
    text <- ldList[[j]]
    text <- tolower(paste(text,collapse=" "))
    write.table(text, "text.txt", quote=FALSE, col.name=FALSE, row.names=FALSE, sep = '\t')
    py_run_string('text=open("text.txt","r")')
    py_run_string('text=text.read()')
    ldValue[[j]] <- str_replace_all(py_capture_output(py_run_string('print(mtld(text.split()))')),"[\r\n]","")
  }
  langLd <- c(Language, ldValue)
  return(langLd)
 }
 
# Calculate MTLD by using the list of bible text file names and language codes
mtldList <- list()
mtldLanList <- list()
mtldLists10 <- list()
for(i in 1:nrow(listBible)){
  bible <- paste0("./",listBible$BibleFile[i])
  Language <- listBible$Language[i] 
  mtldList[[i]] <- ldCal(bible,Language)[2:11]
  mtldLanList[[i]] <- ldCal(bible,Language)[1]
  mtldLists10[[i]] <- cbind(unlist(mtldLanList[[i]]),unlist(mtldList[[i]]))
}
mtldLists10 <- as.data.frame(do.call(rbind, mtldLists10))
colnames(mtldLists10) <- c("Language","MTLD")

# Get the mean of MTLD (10 subsets)
mtldLists10$MTLD <- as.numeric(as.character(mtldLists10$MTLD))
meanMTLD10 <- mtldLists10 %>% 
  group_by(Language) %>% 
  summarise(MeanMTLD= mean(MTLD))
 
# Get the standard deviation of MTLD (10 subsets)
sdMTLD10 <- mtldLists10 %>% 
  group_by(Language) %>% 
  summarise(SdMTLD = sd(MTLD)) 

meanMTLD10 <- merge(meanMTLD10, sdMTLD10, by="Language")

# Total number of subsets: 20 (57 verses per subset)
# Load a list of bible text file names and language names
listBible <- data[,c("Language","BibleFile")]

# Function to calculate MTLD
ldCal <- function(bible, Language){
  bible <- read_delim(bible,"\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
  colnames(bible)[1] <- "Language"
  ldList <- list()
  ldValue <- list()
  for(j in 1:20){
    k <- 57*(j-1) + 1
    l <- 57*j
    ldList[[j]] <- bible$Language[k:l]
    text <- ldList[[j]]
    text <- tolower(paste(text,collapse=" "))
    write.table(text, "text.txt", quote=FALSE, col.name=FALSE, row.names=FALSE, sep = '\t')
    py_run_string('text=open("text.txt","r")')
    py_run_string('text=text.read()')
    ldValue[[j]] <- str_replace_all(py_capture_output(py_run_string('print(mtld(text.split()))')),"[\r\n]","")
  }
  langLd <- c(Language, ldValue)
  return(langLd)
}

# Calculate MTLD by using the list of bible text file names and language codes
mtldList <- list()
mtldLanList <- list()
mtldLists20 <- list()
for(i in 1:nrow(listBible)){
  bible <- paste0("./",listBible$BibleFile[i])
  Language <- listBible$Language[i] 
  mtldList[[i]] <- ldCal(bible,Language)[2:21]
  mtldLanList[[i]] <- ldCal(bible,Language)[1]
  mtldLists20[[i]] <- cbind(unlist(mtldLanList[[i]]),unlist(mtldList[[i]]))
}
mtldLists20 <- as.data.frame(do.call(rbind, mtldLists20))
colnames(mtldLists20) <- c("Language","MTLD")

# Get the mean of MTLD (20 subsets)
mtldLists20$MTLD <- as.numeric(as.character(mtldLists20$MTLD))
meanMTLD20 <- mtldLists20 %>% 
  group_by(Language) %>% 
  summarise(MeanMTLD= mean(MTLD))

# Get the standard deviation of MTLD (20 subsets)
sdMTLD20 <- mtldLists20 %>% 
  group_by(Language) %>% 
  summarise(SdMTLD = sd(MTLD)) 
 
meanMTLD20 <- merge(meanMTLD20, sdMTLD20, by="Language")

# Total number of subsets: 40 (28 verses per subset)
# Load a list of bible text file names and language names
listBible <- data[,c("Language","BibleFile")]

# Function to calculate MTLD
 ldCal <- function(bible, Language){
  bible <- read_delim(bible,"\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
  colnames(bible)[1] <- "Language"
  ldList <- list()
  ldValue <- list()
  for(j in 1:40){
    k <- 28*(j-1) + 1
    l <- 28*j
    ldList[[j]] <- bible$Language[k:l]
    text <- ldList[[j]]
    text <- tolower(paste(text,collapse=" "))
    write.table(text, "text.txt", quote=FALSE, col.name=FALSE, row.names=FALSE, sep = '\t')
    py_run_string('text=open("text.txt","r")')
    py_run_string('text=text.read()')
    ldValue[[j]] <- str_replace_all(py_capture_output(py_run_string('print(mtld(text.split()))')),"[\r\n]","")
  }
  langLd <- c(Language, ldValue)
  return(langLd)
}

# Calculate MTLD by using the list of bible text file names and language codes
mtldList <- list()
mtldLanList <- list()
mtldLists40 <- list()
for(i in 1:nrow(listBible)){
  bible <- paste0("./",listBible$BibleFile[i])
  Language <- listBible$Language[i] 
  mtldList[[i]] <- ldCal(bible,Language)[2:41]
  mtldLanList[[i]] <- ldCal(bible,Language)[1]
  mtldLists40[[i]] <- cbind(unlist(mtldLanList[[i]]),unlist(mtldList[[i]]))
}
mtldLists40 <- as.data.frame(do.call(rbind, mtldLists40))
colnames(mtldLists40) <- c("Language","MTLD")

# Get the mean of MTLD (40 subsets)
mtldLists40$MTLD <- as.numeric(as.character(mtldLists40$MTLD))
meanMTLD40 <- mtldLists40 %>% 
  group_by(Language) %>% 
  summarise(MeanMTLD= mean(MTLD))

# Get the standard deviation of MTLD (40 subsets)
sdMTLD40 <- mtldLists40 %>% 
  group_by(Language) %>% 
  summarise(SdMTLD = sd(MTLD)) 
 
meanMTLD40 <- merge(meanMTLD40, sdMTLD40, by="Language")

# Total number of subsets: 60 (19 verses per subset)
# Load a list of bible text file names and language names
listBible <- data[,c("Language","BibleFile")]

# Function to calculate MTLD
ldCal <- function(bible, Language){
  bible <- read_delim(bible,"\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
  colnames(bible)[1] <- "Language"
  ldList <- list()
  ldValue <- list()
  for(j in 1:60){
    k <- 19*(j-1) + 1
    l <- 19*j
    ldList[[j]] <- bible$Language[k:l]
    text <- ldList[[j]]
    text <- tolower(paste(text,collapse=" "))
    write.table(text, "text.txt", quote=FALSE, col.name=FALSE, row.names=FALSE, sep = '\t')
    py_run_string('text=open("text.txt","r")')
    py_run_string('text=text.read()')
    ldValue[[j]] <- str_replace_all(py_capture_output(py_run_string('print(mtld(text.split()))')),"[\r\n]","")
}
  langLd <- c(Language, ldValue)
  return(langLd)
}

# Calculate MTLD by using the list of bible text file names and language codes
mtldList <- list()
mtldLanList <- list()
mtldLists60 <- list()
for(i in 1:nrow(listBible)){
  bible <- paste0("./morphological complexity/quantitative/Texts_FullParallel/",listBible$BibleFile[i])
  Language <- listBible$Language[i] 
  mtldList[[i]] <- ldCal(bible,Language)[2:61]
  mtldLanList[[i]] <- ldCal(bible,Language)[1]
  mtldLists60[[i]] <- cbind(unlist(mtldLanList[[i]]),unlist(mtldList[[i]]))
}
mtldLists60 <- as.data.frame(do.call(rbind, mtldLists60))
colnames(mtldLists60) <- c("Language","MTLD")

# Get the mean of MTLD (60 subsets)
mtldLists60$MTLD <- as.numeric(as.character(mtldLists60$MTLD))
meanMTLD60 <- mtldLists60 %>% 
  group_by(Language) %>% 
  summarise(MeanMTLD= mean(MTLD))

# Get the standard deviation of MTLD (60 subsets)
sdMTLD60 <- mtldLists60 %>% 
  group_by(Language) %>% 
  summarise(SdMTLD = sd(MTLD)) 

meanMTLD60 <- merge(meanMTLD60, sdMTLD60, by="Language")

# Save the results
meanMTLD1$NbS <- "1"
meanMTLD5$NbS <- "5"
meanMTLD10$NbS <- "10"
meanMTLD20$NbS <- "20"
meanMTLD40$NbS <- "40"
meanMTLD60$NbS <- "60" 
listMeanMTLD <- rbind(meanMTLD1, meanMTLD5, meanMTLD10, meanMTLD20, meanMTLD40, meanMTLD60)

mtldLists1$NbS <- "1"
mtldLists5$NbS <- "5"
mtldLists10$NbS <- "10"
mtldLists20$NbS <- "20"
mtldLists40$NbS <- "40"
mtldLists60$NbS <- "60"
listMTLD <- rbind(mtldLists1, mtldLists5, mtldLists10, mtldLists20, mtldLists40, mtldLists60)

listMTLD <- merge(listMeanMTLD, listMTLD, by=c("Language","NbS"))

4.2.3.1 MTLD by different corpus sampling configurations

The distribution of MTLD is displayed according to different corpus sampling configurations (Whole, 5, 10, 20, 40, and 60 subsets).

4.2.3.2 Rank of average MTLD by different corpus sampling configurations

The language ranks of the average MTLD are displayed according to different corpus sampling configurations (Whole, 5, 10, 20, 40, and 60 subsets).

4.2.3.3 Average MTLD by different corpus sampling configurations in each language

The average MTLD in each language is displayed according to different corpus sampling configurations (Whole, 5, 10, 20, 40, and 60 subsets).

4.2.4 Word-level Entropy (H)

Word-level (unigram) Entropy (H) is defined as the average amount of information (or unpredictability) of words and it can be calculated by the distribution of word probabilities, estimated from the corpus.

# Total number of subsets: 1 (1,150 verses per subset)
# Function to calculate H
hCal <- function(bible, Language){
  Bible <- read_delim(bible,"\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
  colnames(Bible)[1] <- "Language"
  text <- Bible$Language[1:nrow(Bible)]
  text <- tolower(paste(text,collapse=" "))
  type <- as.data.frame(table(strsplit(text," ")))
  type$prob <- type$Freq/sum(type$Freq)
  type$prob1 <- log2(type$prob)*-1
  h <- sum(type$prob*type$prob1)
  hV <- c(Language, h)
  return(hV)
}

# Load a list of bible text file names and language names
listBible <- data[,c("Language","BibleFile")]

# Calculate H by using the list of bible text file names and language codes
hList <- list()
hLanList <- list()
hLists1 <- list()
for(i in 1:nrow(listBible)){
  Bible <- paste0("./",listBible$BibleFile[i])
  Language <- listBible$Language[i] 
  hList[[i]] <- hCal(Bible,Language)[2]
  hLanList[[i]] <- hCal(Bible,Language)[1]
  hLists1[[i]] <- cbind(unlist(hLanList[[i]]),unlist(hList[[i]]))
}
hLists1 <- as.data.frame(do.call(rbind, hLists1))
colnames(hLists1) <- c("Language","H")

# Get the mean of H (whole set)
hLists1$H <- as.numeric(as.character(hLists1$H))
meanH1 <- hLists1 %>% 
  group_by(Language) %>% 
  summarise(MeanH= mean(H))

# Get the standard deviation of H (whole set)
sdH1 <- hLists1 %>% 
  group_by(Language) %>% 
  summarise(SdH= sd(H)) 

meanH1 <- merge(meanH1, sdH1, by="Language")

# Total number of subsets: 5 (230 verses per subset)
# Function to calculate H
hCal <- function(Bible, Language){
  Bible <- read_delim(Bible,"\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
  colnames(Bible)[1] <- "Language"
  hV <- list()
  for(j in 1:5){
    k <- 230*(j-1) + 1
    l <- 230*j
    text <- Bible$Language[k:l]
    text <- tolower(paste(text,collapse=" "))
    type <- as.data.frame(table(strsplit(text," ")))
    type$prob <- type$Freq/sum(type$Freq)
    type$prob1 <- log2(type$prob)*-1
    h <- sum(type$prob*type$prob1)
    hV[[j]] <- h
  }
  hVList <- c(Language, hV)
  return(hVList)
}

# Load a list of bible text file names and language names
listBible <- data[,c("Language","BibleFile")]

# Calculate H by using the list of bible text file names and language codes
hList <- list()
hLanList <- list()
hLists5 <- list()
for(i in 1:nrow(listBible)){
  Bible <- paste0("./",listBible$BibleFile[i])
  Language <- listBible$Language[i] 
  hList[[i]] <- hCal(Bible,Language)[2:6]
  hLanList[[i]] <- hCal(Bible,Language)[1]
  hLists5[[i]] <- cbind(unlist(hLanList[[i]]),unlist(hList[[i]]))
}
hLists5 <- as.data.frame(do.call(rbind, hLists5))
colnames(hLists5) <- c("Language","H")

# Get the mean of H (5 subsets)
hLists5$H <- as.numeric(as.character(hLists5$H))
meanH5 <- hLists5 %>% 
  group_by(Language) %>% 
  summarise(MeanH = mean(H))

# Get the standard deviation of H (5 subsets)
sdH5 <- hLists5 %>% 
  group_by(Language) %>% 
  summarise(SdH= sd(H)) 

meanH5 <- merge(meanH5, sdH5, by="Language")

# Total number of subsets: 10 (115 verses per subset) 
# Function to calculate H
hCal <- function(Bible, Language){
  Bible <- read_delim(Bible,"\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
  colnames(Bible)[1] <- "Language"
  hV <- list()
  for(j in 1:10){
    k <- 115*(j-1) + 1
    l <- 115*j
    text <- Bible$Language[k:l]
    text <- tolower(paste(text,collapse=" "))
    type <- as.data.frame(table(strsplit(text," ")))
    type$prob <- type$Freq/sum(type$Freq)
    type$prob1 <- log2(type$prob)*-1
    h <- sum(type$prob*type$prob1)
    hV[[j]] <- h
  }
  hVList <- c(Language, hV)
  return(hVList)
}

# Load a list of bible text file names and language names
listBible <- data[,c("Language","BibleFile")]

# Calculate H by using the list of bible text file names and language codes
hList <- list()
hLanList <- list()
hLists10 <- list()
for(i in 1:nrow(listBible)){
  Bible <- paste0("./",listBible$BibleFile[i])
  Language <- listBible$Language[i] 
  hList[[i]] <- hCal(Bible,Language)[2:11]
  hLanList[[i]] <- hCal(Bible,Language)[1]
  hLists10[[i]] <- cbind(unlist(hLanList[[i]]),unlist(hList[[i]]))
}
hLists10 <- as.data.frame(do.call(rbind, hLists10))
colnames(hLists10) <- c("Language","H")

# Get the mean of H (10 subsets)
hLists10$H <- as.numeric(as.character(hLists10$H))
meanH10 <- hLists10 %>% 
  group_by(Language) %>% 
  summarise(MeanH = mean(H))

# Get the standard deviation of H (10 subsets)
sdH10 <- hLists10 %>% 
  group_by(Language) %>% 
  summarise(SdH= sd(H)) 

meanH10 <- merge(meanH10, sdH10, by="Language")

# Total number of subsets: 20 (57 verses per subset) 
# Function to calculate H
hCal <- function(Bible, Language){
  Bible <- read_delim(Bible,"\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
  colnames(Bible)[1] <- "Language"
  hV <- list()
  for(j in 1:20){
    k <- 57*(j-1) + 1
    l <- 57*j
    text <- Bible$Language[k:l]
    text <- tolower(paste(text,collapse=" "))
    type <- as.data.frame(table(strsplit(text," ")))
    type$prob <- type$Freq/sum(type$Freq)
    type$prob1 <- log2(type$prob)*-1
    h <- sum(type$prob*type$prob1)
    hV[[j]] <- h
  }
  hVList <- c(Language, hV)
  return(hVList)
}

# Load a list of bible text file names and language names
listBible <- data[,c("Language","BibleFile")]

# Calculate H by using the list of bible text file names and language codes
hList <- list()
hLanList <- list()
hLists20 <- list()
for(i in 1:nrow(listBible)){
  Bible <- paste0("./",listBible$BibleFile[i])
  Language <- listBible$Language[i] 
  hList[[i]] <- hCal(Bible,Language)[2:21]
  hLanList[[i]] <- hCal(Bible,Language)[1]
  hLists20[[i]] <- cbind(unlist(hLanList[[i]]),unlist(hList[[i]]))
}
hLists20 <- as.data.frame(do.call(rbind, hLists20))
colnames(hLists20) <- c("Language","H")

# Get the mean of H (20 subsets)
hLists20$H <- as.numeric(as.character(hLists20$H))
meanH20 <- hLists20 %>% 
  group_by(Language) %>% 
  summarise(MeanH = mean(H))

# Get the standard deviation of H (20 subsets)
sdH20 <- hLists20 %>% 
  group_by(Language) %>% 
  summarise(SdH = sd(H)) 

meanH20 <- merge(meanH20, sdH20, by="Language")

# Total number of subsets: 40 (28 verses per subset)
# Function to calculate H
hCal <- function(Bible, Language){
  Bible <- read_delim(Bible,"\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
  colnames(Bible)[1] <- "Language"
  hV <- list()
  for(j in 1:40){
    k <- 28*(j-1) + 1
    l <- 28*j
    text <- Bible$Language[k:l]
    text <- tolower(paste(text,collapse=" "))
    type <- as.data.frame(table(strsplit(text," ")))
    type$prob <- type$Freq/sum(type$Freq)
    type$prob1 <- log2(type$prob)*-1
    h <- sum(type$prob*type$prob1)
    hV[[j]] <- h
  }
  hVList <- c(Language, hV)
  return(hVList)
}

# Load a list of bible text file names and language names
listBible <- data[,c("Language","BibleFile")]

# Calculate H by using the list of bible text file names and language codes
hList <- list()
hLanList <- list()
hLists40 <- list()
for(i in 1:nrow(listBible)){
  Bible <- paste0("./",listBible$BibleFile[i])
  Language <- listBible$Language[i] 
  hList[[i]] <- hCal(Bible,Language)[2:41]
  hLanList[[i]] <- hCal(Bible,Language)[1]
  hLists40[[i]] <- cbind(unlist(hLanList[[i]]),unlist(hList[[i]]))
}
hLists40 <- as.data.frame(do.call(rbind, hLists40))
colnames(hLists40) <- c("Language","H")

# Get the mean of H (40 subsets)
hLists40$H <- as.numeric(as.character(hLists40$H))
meanH40 <- hLists40 %>% 
  group_by(Language) %>% 
  summarise(MeanH = mean(H))

# Get the standard deviation of H (40 subsets)
sdH40 <- hLists40 %>% 
  group_by(Language) %>% 
  summarise(SdH = sd(H)) 

meanH40 <- merge(meanH40, sdH40, by="Language")

# Total number of subsets: 60 (19 verses per subset)
# Function to calculate H
hCal <- function(Bible, Language){
  Bible <- read_delim(Bible,"\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
  colnames(Bible)[1] <- "Language"
  hV <- list()
  for(j in 1:60){
    k <- 19*(j-1) + 1
    l <- 19*j
    text <- Bible$Language[k:l]
    text <- tolower(paste(text,collapse=" "))
    type <- as.data.frame(table(strsplit(text," ")))
    type$prob <- type$Freq/sum(type$Freq)
    type$prob1 <- log2(type$prob)*-1
    h <- sum(type$prob*type$prob1)
    hV[[j]] <- h
  }
  hVList <- c(Language, hV)
  return(hVList)
}

# Load a list of bible text file names and language names
listBible <- data[,c("Language","BibleFile")]

# Calculate H by using the list of bible text file names and language codes
hList <- list()
hLanList <- list()
hLists60 <- list()
for(i in 1:nrow(listBible)){
  Bible <- paste0("./",listBible$BibleFile[i])
  Language <- listBible$Language[i] 
  hList[[i]] <- hCal(Bible,Language)[2:61]
  hLanList[[i]] <- hCal(Bible,Language)[1]
  hLists60[[i]] <- cbind(unlist(hLanList[[i]]),unlist(hList[[i]]))
}
hLists60 <- as.data.frame(do.call(rbind, hLists60))
colnames(hLists60) <- c("Language","H")

# Get the mean of H (60 subsets)
hLists60$H <- as.numeric(as.character(hLists60$H))
meanH60 <- hLists60 %>% 
  group_by(Language) %>% 
  summarise(MeanH = mean(H))

# Get the standard deviation of H (60 subsets)
sdH60 <- hLists60 %>% 
  group_by(Language) %>% 
  summarise(SdH= sd(H)) 

meanH60 <- merge(meanH60, sdH60, by="Language")

# Save the results
meanH1$NbS <- "1"
meanH5$NbS <- "5"
meanH10$NbS <- "10"
meanH20$NbS <- "20"
meanH40$NbS <- "40"
meanH60$NbS <- "60" 
listMeanH <- rbind(meanH1, meanH5, meanH10, meanH20, meanH40, meanH60)

hLists1$NbS <- "1"
hLists5$NbS <- "5"
hLists10$NbS <- "10"
hLists20$NbS <- "20"
hLists40$NbS <- "40"
hLists60$NbS <- "60"
listH <- rbind(hLists1, hLists5, hLists10, hLists20, hLists40, hLists60)

listH <- merge(listMeanH, listH, by=c("Language","NbS"))

4.2.4.1 H by different corpus sampling configurations

The distribution of H is displayed according to different corpus sampling configurations (Whole, 5, 10, 20, 40, and 60 subsets).

4.2.4.2 Rank of average H by different corpus sampling configurations

The language ranks of the average H are displayed according to different corpus sampling configurations (Whole, 5, 10, 20, 40, and 60 subsets).

4.2.4.3 Average H by different corpus sampling configurations in each language

The average H in each language is displayed according to different corpus sampling configurations (Whole, 5, 10, 20, 40, and 60 subsets).

4.2.5 Impact of the sampling configuration on the four complexity indices

Impact of the sampling configuration on the four complexity indices (TTR: Type-Token Ratio; *H*: word-level Entropy; MTLD: Measure of Textual Lexical Diversity; WID: Word Information Density). In each panel, the y-axis shows the language ranks according to the sampling configuration (whole set, 5, 10, 20, 40, and 60 subsets, on the x-axis). Languages are displayed in gray when rank is preserved throughout all configurations and in orange when changes occur, with orange edges underlying the changes.

Impact of the sampling configuration on the four complexity indices (TTR: Type-Token Ratio; H: word-level Entropy; MTLD: Measure of Textual Lexical Diversity; WID: Word Information Density). In each panel, the y-axis shows the language ranks according to the sampling configuration (whole set, 5, 10, 20, 40, and 60 subsets, on the x-axis). Languages are displayed in gray when rank is preserved throughout all configurations and in orange when changes occur, with orange edges underlying the changes.

4.3 Comparing corpus-based and grammar-based indices

4.3.1 Correlations observed in the whole set sampling configuration

Correlations observed in the Whole set sampling configuration. Panels on the diagonal show each index distribution across the languages. Top panels report Kendall's tau between paired indices with font size symbolizing magnitude and stars illustrating statistical significance. Bottom panels display bivariate scatter plots with a fitted line. Indices are defined in Section 4 “Morphological Complexity”, except IWI and LEX, introduced in Section 5 “Beyond Word Complexity”, and given in this graph for facilitating the general discussion.

Correlations observed in the Whole set sampling configuration. Panels on the diagonal show each index distribution across the languages. Top panels report Kendall’s tau between paired indices with font size symbolizing magnitude and stars illustrating statistical significance. Bottom panels display bivariate scatter plots with a fitted line. Indices are defined in Section 4 “Morphological Complexity”, except IWI and LEX, introduced in Section 5 “Beyond Word Complexity”, and given in this graph for facilitating the general discussion.

4.3.2 Bayesian correlation analyses

All correlations below are estimated in a Bayesian framework with the BayesFactor and bayestestR packages in R. We used the default prior options of the following two functions (correlationBF and describe_posterior) for the correlation analysis (see the code below). Each correlation is reported as the median Bayesian posterior estimate, along with the 95% credible intervals for each correlation coefficient under a two-sided alternative hypothesis. The BayesFactor (BF) in support of the alterative hypothesis (viz. the existence of a correlation) is also reported. BF > 10 indicates a strong support in favor of the existence of a correlation. 3 < BF < 10 indicates a moderate support while BF values between one and three are considered weak.

BF1 <- describe_posterior(correlationBF(data$GMC_W, data$GMC_A))
kable_styling(kable(BF1[1:12],  align="c", format = "html", caption = "<left>GMC_W and GMC_A</left>", escape = FALSE, booktabs = TRUE), bootstrap_options = c("striped", "hover","responsive", "condensed"), position="center", font_size = 12)

BF2 <- describe_posterior(correlationBF(data$GMC_W, data$WID))
kable_styling(kable(BF2[1:12],  align="c", format = "html", caption = "<left>GMC_W and WID</left>", escape = FALSE, booktabs = TRUE), bootstrap_options = c("striped", "hover","responsive", "condensed"), position="center", font_size = 12)

BF3 <- describe_posterior(correlationBF(data$GMC_W, data$MTLD))
kable_styling(kable(BF3[1:12],  align="c", format = "html", caption = "<left>GMC_W and MTLD</left>", escape = FALSE, booktabs = TRUE), bootstrap_options = c("striped", "hover","responsive", "condensed"), position="center", font_size = 12)

BF4 <- describe_posterior(correlationBF(data$GMC_W, data$TTR))
kable_styling(kable(BF4[1:12],  align="c", format = "html", caption = "<left>GMC_W and TTR</left>", escape = FALSE, booktabs = TRUE), bootstrap_options = c("striped", "hover","responsive", "condensed"), position="center", font_size = 12)

BF5 <- describe_posterior(correlationBF(data$GMC_W, data$H))
kable_styling(kable(BF5[1:12],  align="c", format = "html", caption = "<left>GMC_W and H</left>", escape = FALSE, booktabs = TRUE), bootstrap_options = c("striped", "hover","responsive", "condensed"), position="center", font_size = 12)

BF6 <- describe_posterior(correlationBF(data$GMC_W, data$IWI))
kable_styling(kable(BF6[1:12],  align="c", format = "html", caption = "<left>GMC_W and IWI</left>", escape = FALSE, booktabs = TRUE), bootstrap_options = c("striped", "hover","responsive", "condensed"), position="center", font_size = 12)

BF7 <- describe_posterior(correlationBF(data$GMC_W, data$LEX))
kable_styling(kable(BF7[1:12],  align="c", format = "html", caption = "<left>GMC_W and LEX</left>", escape = FALSE, booktabs = TRUE), bootstrap_options = c("striped", "hover","responsive", "condensed"), position="center", font_size = 12)

BF8 <- describe_posterior(correlationBF(data$GMC_A, data$WID))
kable_styling(kable(BF8[1:12],  align="c", format = "html", caption = "<left>GMC_A and WID</left>", escape = FALSE, booktabs = TRUE), bootstrap_options = c("striped", "hover","responsive", "condensed"), position="center", font_size = 12)

BF9 <- describe_posterior(correlationBF(data$GMC_A, data$MTLD))
kable_styling(kable(BF9[1:12],  align="c", format = "html", caption = "<left>GMC_A and MTLD</left>", escape = FALSE, booktabs = TRUE), bootstrap_options = c("striped", "hover","responsive", "condensed"), position="center", font_size = 12)

BF10 <- describe_posterior(correlationBF(data$GMC_A, data$TTR))
kable_styling(kable(BF10[1:12],  align="c", format = "html", caption = "<left>GMC_A and TTR</left>", escape = FALSE, booktabs = TRUE), bootstrap_options = c("striped", "hover","responsive", "condensed"), position="center", font_size = 12)

BF11 <- describe_posterior(correlationBF(data$GMC_A, data$H))
kable_styling(kable(BF11[1:12],  align="c", format = "html", caption = "<left>GMC_A and H</left>", escape = FALSE, booktabs = TRUE), bootstrap_options = c("striped", "hover","responsive", "condensed"), position="center", font_size = 12)

BF12 <- describe_posterior(correlationBF(data$GMC_A, data$IWI))
kable_styling(kable(BF12[1:12],  align="c", format = "html", caption = "<left>GMC_A and IWI</left>", escape = FALSE, booktabs = TRUE), bootstrap_options = c("striped", "hover","responsive", "condensed"), position="center", font_size = 12)

BF13 <- describe_posterior(correlationBF(data$GMC_A, data$LEX))
kable_styling(kable(BF13[1:12],  align="c", format = "html", caption = "<left>GMC_A and LEX</left>", escape = FALSE, booktabs = TRUE), bootstrap_options = c("striped", "hover","responsive", "condensed"), position="center", font_size = 12)

BF14 <- describe_posterior(correlationBF(data$WID, data$MTLD))
kable_styling(kable(BF14[1:12],  align="c", format = "html", caption = "<left>WID and MTLD</left>", escape = FALSE, booktabs = TRUE), bootstrap_options = c("striped", "hover","responsive", "condensed"), position="center", font_size = 12)

BF15 <- describe_posterior(correlationBF(data$WID, data$TTR))
kable_styling(kable(BF15[1:12],  align="c", format = "html", caption = "<left>WID and TTR</left>", escape = FALSE, booktabs = TRUE), bootstrap_options = c("striped", "hover","responsive", "condensed"), position="center", font_size = 12)

BF16 <- describe_posterior(correlationBF(data$WID, data$H))
kable_styling(kable(BF16[1:12],  align="c", format = "html", caption = "<left>WID and H</left>", escape = FALSE, booktabs = TRUE), bootstrap_options = c("striped", "hover","responsive", "condensed"), position="center", font_size = 12)

BF17 <- describe_posterior(correlationBF(data$WID, data$IWI))
kable_styling(kable(BF17[1:12],  align="c", format = "html", caption = "<left>WID and IWI</left>", escape = FALSE, booktabs = TRUE), bootstrap_options = c("striped", "hover","responsive", "condensed"), position="center", font_size = 12)

BF18 <- describe_posterior(correlationBF(data$WID, data$LEX))
kable_styling(kable(BF18[1:12],  align="c", format = "html", caption = "<left>WID and LEX</left>", escape = FALSE, booktabs = TRUE), bootstrap_options = c("striped", "hover","responsive", "condensed"), position="center", font_size = 12)

BF19 <- describe_posterior(correlationBF(data$MTLD, data$TTR))
kable_styling(kable(BF19[1:12],  align="c", format = "html", caption = "<left>MTLD and TTR</left>", escape = FALSE, booktabs = TRUE), bootstrap_options = c("striped", "hover","responsive", "condensed"), position="center", font_size = 12)

BF20 <- describe_posterior(correlationBF(data$MTLD, data$H))
kable_styling(kable(BF20[1:12],  align="c", format = "html", caption = "<left>MTLD and H</left>", escape = FALSE, booktabs = TRUE), bootstrap_options = c("striped", "hover","responsive", "condensed"), position="center", font_size = 12)

BF21 <- describe_posterior(correlationBF(data$MTLD, data$IWI))
kable_styling(kable(BF21[1:12],  align="c", format = "html", caption = "<left>MTLD and IWI</left>", escape = FALSE, booktabs = TRUE), bootstrap_options = c("striped", "hover","responsive", "condensed"), position="center", font_size = 12)

BF22 <- describe_posterior(correlationBF(data$MTLD, data$LEX))
kable_styling(kable(BF22[1:12],  align="c", format = "html", caption = "<left>MTLD and LEX</left>", escape = FALSE, booktabs = TRUE), bootstrap_options = c("striped", "hover","responsive", "condensed"), position="center", font_size = 12)

BF23 <- describe_posterior(correlationBF(data$TTR, data$H))
kable_styling(kable(BF23[1:12],  align="c", format = "html", caption = "<left>TTR and H</left>", escape = FALSE, booktabs = TRUE), bootstrap_options = c("striped", "hover","responsive", "condensed"), position="center", font_size = 12)

BF24 <- describe_posterior(correlationBF(data$TTR, data$IWI))
kable_styling(kable(BF24[1:12],  align="c", format = "html", caption = "<left>TTR and IWI</left>", escape = FALSE, booktabs = TRUE), bootstrap_options = c("striped", "hover","responsive", "condensed"), position="center", font_size = 12)

BF25 <- describe_posterior(correlationBF(data$TTR, data$LEX))
kable_styling(kable(BF25[1:12],  align="c", format = "html", caption = "<left>TTR and LEX</left>", escape = FALSE, booktabs = TRUE), bootstrap_options = c("striped", "hover","responsive", "condensed"), position="center", font_size = 12)

BF26 <- describe_posterior(correlationBF(data$H, data$IWI))
kable_styling(kable(BF26[1:12],  align="c", format = "html", caption = "<left>H and IWI</left>", escape = FALSE, booktabs = TRUE), bootstrap_options = c("striped", "hover","responsive", "condensed"), position="center", font_size = 12)

BF27 <- describe_posterior(correlationBF(data$H, data$LEX))
kable_styling(kable(BF27[1:12],  align="c", format = "html", caption = "<left>H and LEX</left>", escape = FALSE, booktabs = TRUE), bootstrap_options = c("striped", "hover","responsive", "condensed"), position="center", font_size = 12)

BF28 <- describe_posterior(correlationBF(data$IWI, data$LEX))
kable_styling(kable(BF28[1:12],  align="c", format = "html", caption = "<left>IWI and LEX</left>", escape = FALSE, booktabs = TRUE), bootstrap_options = c("striped", "hover","responsive", "condensed"), position="center", font_size = 12)
GMC_W and GMC_A
Parameter Median CI CI_low CI_high pd ROPE_CI ROPE_low ROPE_high ROPE_Percentage log_BF BF
rho 0.368084 0.95 0.1222868 0.5813985 0.9985 0.95 -0.05 0.05 0 2.579227 13.18694
GMC_W and WID
Parameter Median CI CI_low CI_high pd ROPE_CI ROPE_low ROPE_high ROPE_Percentage log_BF BF
rho 0.4629212 0.95 0.2357978 0.6500067 0.99975 0.95 -0.05 0.05 0 4.925697 137.7853
GMC_W and MTLD
Parameter Median CI CI_low CI_high pd ROPE_CI ROPE_low ROPE_high ROPE_Percentage log_BF BF
rho 0.2875769 0.95 0.0148651 0.52473 0.98075 0.95 -0.05 0.05 0.0155263 1.110483 3.035824
GMC_W and TTR
Parameter Median CI CI_low CI_high pd ROPE_CI ROPE_low ROPE_high ROPE_Percentage log_BF BF
rho 0.5526638 0.95 0.2857989 0.7180013 1 0.95 -0.05 0.05 0 8.164535 3514.088
GMC_W and H
Parameter Median CI CI_low CI_high pd ROPE_CI ROPE_low ROPE_high ROPE_Percentage log_BF BF
rho 0.5791322 0.95 0.3216001 0.7323639 1 0.95 -0.05 0.05 0 8.95151 7719.543
GMC_W and IWI
Parameter Median CI CI_low CI_high pd ROPE_CI ROPE_low ROPE_high ROPE_Percentage log_BF BF
rho -0.5569306 0.95 -0.7205411 -0.3398518 1 0.95 -0.05 0.05 0 8.197116 3630.466
GMC_W and LEX
Parameter Median CI CI_low CI_high pd ROPE_CI ROPE_low ROPE_high ROPE_Percentage log_BF BF
rho -0.3596098 0.95 -0.5732157 -0.0992704 0.99825 0.95 -0.05 0.05 0 2.359017 10.58055
GMC_A and WID
Parameter Median CI CI_low CI_high pd ROPE_CI ROPE_low ROPE_high ROPE_Percentage log_BF BF
rho 0.1262596 0.95 -0.1357893 0.3799691 0.8195 0.95 -0.05 0.05 0.1992105 -0.7016012 0.4957908
GMC_A and MTLD
Parameter Median CI CI_low CI_high pd ROPE_CI ROPE_low ROPE_high ROPE_Percentage log_BF BF
rho 0.0479216 0.95 -0.2243636 0.3112007 0.63325 0.95 -0.05 0.05 0.2778947 -1.0584 0.3470107
GMC_A and TTR
Parameter Median CI CI_low CI_high pd ROPE_CI ROPE_low ROPE_high ROPE_Percentage log_BF BF
rho 0.2756988 0.95 0.0012077 0.5065698 0.97525 0.95 -0.05 0.05 0.0236842 0.9682375 2.633299
GMC_A and H
Parameter Median CI CI_low CI_high pd ROPE_CI ROPE_low ROPE_high ROPE_Percentage log_BF BF
rho 0.2496223 0.95 -0.0158695 0.4929818 0.96875 0.95 -0.05 0.05 0.0436842 0.6103013 1.840986
GMC_A and IWI
Parameter Median CI CI_low CI_high pd ROPE_CI ROPE_low ROPE_high ROPE_Percentage log_BF BF
rho -0.1743291 0.95 -0.4147633 0.1108113 0.88925 0.95 -0.05 0.05 0.1475927 -0.3197712 0.7263152
GMC_A and LEX
Parameter Median CI CI_low CI_high pd ROPE_CI ROPE_low ROPE_high ROPE_Percentage log_BF BF
rho -0.0923351 0.95 -0.3533555 0.1736132 0.739 0.95 -0.05 0.05 0.2457895 -0.8804147 0.4146109
WID and MTLD
Parameter Median CI CI_low CI_high pd ROPE_CI ROPE_low ROPE_high ROPE_Percentage log_BF BF
rho 0.7509852 0.95 0.5982481 0.8499247 1 0.95 -0.05 0.05 0 18.55596 114485254
WID and TTR
Parameter Median CI CI_low CI_high pd ROPE_CI ROPE_low ROPE_high ROPE_Percentage log_BF BF
rho 0.8918917 0.95 0.8079189 0.936398 1 0.95 -0.05 0.05 0 34.84776 1.362039e+15
WID and H
Parameter Median CI CI_low CI_high pd ROPE_CI ROPE_low ROPE_high ROPE_Percentage log_BF BF
rho 0.7784292 0.95 0.6294073 0.8676515 1 0.95 -0.05 0.05 0 20.82926 1111816653
WID and IWI
Parameter Median CI CI_low CI_high pd ROPE_CI ROPE_low ROPE_high ROPE_Percentage log_BF BF
rho -0.793148 0.95 -0.8786097 -0.6651088 1 0.95 -0.05 0.05 0 22.55397 6238269905
WID and LEX
Parameter Median CI CI_low CI_high pd ROPE_CI ROPE_low ROPE_high ROPE_Percentage log_BF BF
rho -0.2400747 0.95 -0.4763304 0.0402882 0.9595 0.95 -0.05 0.05 0.0576316 0.4372291 1.548411
MTLD and TTR
Parameter Median CI CI_low CI_high pd ROPE_CI ROPE_low ROPE_high ROPE_Percentage log_BF BF
rho 0.771175 0.95 0.6221074 0.8639798 1 0.95 -0.05 0.05 0 20.34894 687755050
MTLD and H
Parameter Median CI CI_low CI_high pd ROPE_CI ROPE_low ROPE_high ROPE_Percentage log_BF BF
rho 0.587658 0.95 0.3755961 0.7418451 1 0.95 -0.05 0.05 0 9.38867 11952.2
MTLD and IWI
Parameter Median CI CI_low CI_high pd ROPE_CI ROPE_low ROPE_high ROPE_Percentage log_BF BF
rho -0.4313651 0.95 -0.6335487 -0.1855626 0.9975 0.95 -0.05 0.05 0 4.17146 64.81
MTLD and LEX
Parameter Median CI CI_low CI_high pd ROPE_CI ROPE_low ROPE_high ROPE_Percentage log_BF BF
rho -0.2339513 0.95 -0.4681838 0.0356418 0.95525 0.95 -0.05 0.05 0.0663158 0.3370314 1.400783
TTR and H
Parameter Median CI CI_low CI_high pd ROPE_CI ROPE_low ROPE_high ROPE_Percentage log_BF BF
rho 0.8902897 0.95 0.8234914 0.9385917 1 0.95 -0.05 0.05 0 34.69387 1.16777e+15
TTR and IWI
Parameter Median CI CI_low CI_high pd ROPE_CI ROPE_low ROPE_high ROPE_Percentage log_BF BF
rho -0.709828 0.95 -0.8233666 -0.5386456 1 0.95 -0.05 0.05 0 15.74256 6869211
TTR and LEX
Parameter Median CI CI_low CI_high pd ROPE_CI ROPE_low ROPE_high ROPE_Percentage log_BF BF
rho -0.3170426 0.95 -0.5420163 -0.0624602 0.99025 0.95 -0.05 0.05 0 1.607086 4.988256
H and IWI
Parameter Median CI CI_low CI_high pd ROPE_CI ROPE_low ROPE_high ROPE_Percentage log_BF BF
rho -0.7055411 0.95 -0.8242122 -0.5420792 1 0.95 -0.05 0.05 0 15.68762 6501977
H and LEX
Parameter Median CI CI_low CI_high pd ROPE_CI ROPE_low ROPE_high ROPE_Percentage log_BF BF
rho -0.2625385 0.95 -0.4950762 -0.0032628 0.977 0.95 -0.05 0.05 0.0292105 0.7467516 2.110134
IWI and LEX
Parameter Median CI CI_low CI_high pd ROPE_CI ROPE_low ROPE_high ROPE_Percentage log_BF BF
rho 0.5154311 0.95 0.2938463 0.6907787 1 0.95 -0.05 0.05 0 6.707363 818.4099

5 Beyond word complexity

5.1 Inter-Word Information (IWI)

Inter-Word Information (IWI) estimates the amount of information across words by measuring the average compression ratio in each language L, i.e., the change in the size of compressed text files in Language L before and after distorting word order by a random permutation.

# Load a list of bible text file names and language names
listBible <- data[,c("Language","BibleFile")]

# Calculate IWI by using the list of bible text file names and language codes
meanRatioList <- list()
 for(i in 1:nrow(listBible)){
  bibleTxt <- paste0("./",listBible$BibleFile[i])
  Language <- listBible$Language[i]
  ratio <- list()
  for(j in 1:10){
  bible <- read_delim(bibleTxt,"\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
  bibleText <- paste(unlist(strsplit(bible$X1, " ")), collapse = " ")
  write.table(bibleText, "original.txt", quote=FALSE, col.name=FALSE, row.names=FALSE, sep = '\t')
  zip(zipfile = "original.zip", files = "original.txt")
  originalFile <- "original.zip"
  filesize <- file.info(originalFile)$size
  k <- j*10
  set.seed(k)
  bibleTextSpl <-strsplit(as.character(bibleText), " ")
  bibleTextSpl <- as.data.frame(bibleTextSpl)
  colnames(bibleTextSpl)[1] <- "X"
  bibleTextRandom <- sample(bibleTextSpl$X)
  bibleTextRandom <- paste(bibleTextRandom, collapse = " ")
  write.table(bibleTextRandom, "random.txt", quote=FALSE, col.name=FALSE, row.names=FALSE, sep = '\t')
  zip(zipfile = "random.zip", files = "random.txt")
  randomFile <- "random.zip"
  randomFilesize <- file.info(randomFile)$size
  ratioF <- filesize/randomFilesize
  ratio[[j]] <- 1- ratioF
  meanRatio <- mean(unlist(ratio))
  }
  meanRatioList[[i]] <- c(Language, meanRatio)
 }
meanRatioListT <- as.data.frame(do.call(rbind, meanRatioList))
colnames(meanRatioListT) <- c("Language", "MeanC")
meanCEng <- as.numeric(meanRatioListT[meanRatioListT$Language=="English",]$MeanC)
meanRatioListT$IWI <- as.numeric(meanRatioListT$MeanC)/meanCEng
meanRatioListT$MeanC <- NULL
Inter-Word Information (IWI). On the x-axis, languages are ordered by increasing IWI values from left to right. Marker convention is shown in Figure 1b (also in Supplementary Information (§8.1)).

Inter-Word Information (IWI). On the x-axis, languages are ordered by increasing IWI values from left to right. Marker convention is shown in Figure 1b (also in Supplementary Information (§8.1)).

5.1.1 IWI & WID

Distribution of the languages according to their information encoding strategies along the Inter-Word Information and Word Information Density (IWI and WID, respectively) dimensions. Markers convention is shown in Figure 1b (also in Supplementary Information (§8.1)).

Distribution of the languages according to their information encoding strategies along the Inter-Word Information and Word Information Density (IWI and WID, respectively) dimensions. Markers convention is shown in Figure 1b (also in Supplementary Information (§8.1)).

5.2 Language EXplicitness (LEX)

Language EXplicitness (LEX) quantifies the amount of information explicitly encoded by a language. It is calculated by the product of WID and IWI, which respectively estimate the amount of information within-word and across-word dimensions with regard to English.

Languages ranked by increasing LEX (Language EXplicitness). Marker convention is shown in Figure 1b.

Languages ranked by increasing LEX (Language EXplicitness). Marker convention is shown in Figure 1b.

5.2.1 LEX with respect to macroarea, phonological and morphological variables

In this section, languages ranked by increasing LEX are displayed with respect to macroarea, phonological variables (Syllable Structure and Tonal System) and morphological variables (Morphological Strategy, Fusion, and Exponence).

Phonological complexity is estimated by the degree of syllable complexity obtained from WALS in this study (Maddieson 2013, see also Easterday et al. 2021). In WALS, languages are classified into three categories in terms of their maximal syllable structure: (i) Simple: (C)V, (ii) Moderately complex: (C)(C)V(C), (iii) Complex: (C)(C)(C)V(C)(C)(C)(C). Missing values for Barasano and Khalkha were completed by the authors (Jones & Jones 1991; Svantesson 1994). Among 47 languages of the dataset, 9 languages belong to the Simple category (19%) and 25 languages are in the Moderately complex (53%) and 13 languages are Complex (28%). These indices are not directly integrated in the study but they are made available to allow additional interpretation by providing additional background.

6 Breaking the parallel corpus barrier: a proof of concept

The code below describes how Word Information Density (WID) estimated from different configurations are calculated. 1) WID (Whole Parallel corpus, 1,150 verses), 2) WID_FP (Full Parallel corpus, 20 subsets), 3) WID_PP (Pairwise Parallel corpus, 20 permutations), and 4) WID_NP (Non-Parallel corpus, 20 permutations).

# 1) Calculate WID (Whole Parallel corpus, 1,150 verses)
# Load a list of bible text file names and language codes and count the number of words in each subset in English (Parallel Bible Corpus can be downloaded from http://www.christianbentz.de/MLC2019_data.html)
listBible <- data[,c("Language","BibleFile")]
bibleENG <- listBible[listBible$Language=="English",]$BibleFile
bibleENG <- read_delim(bibleENG, "\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
wordCount <- as.data.frame(sapply(strsplit(bibleENG$X1, " "), length))
colnames(wordCount)[1] <- "ENG"
wordCount$ID <- seq_along(wordCount$ENG)

# Function to calculate Word Information Density (WID), using English as a reference
 wordCountCal <- function(bible, Language){
  bible <- read_delim(bible,"\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
  bible <- as.data.frame(sapply(strsplit(bible$X1, " "), length))
  colnames(bible)[1] <- Language
  bible$ID <- seq_along(bible[,1])
  wordCount <- merge(wordCount,bible, by="ID")
  WID <- sum(wordCount$ENG)/sum(wordCount[,ncol(wordCount)])
  langWID <- c(Language, WID)
  return(langWID)
 }

# Calculate WID by using the list of bible text file names and language codes
widList <- list()
widLanList <- list()
widLists <- list()
for(i in 1:nrow(listBible)){
 bible <- paste0("./",listBible$BibleFile[i])
 Language <- listBible$Language[i] 
 widList[[i]] <- wordCountCal(bible,Language)[2]
 widLanList[[i]] <- wordCountCal(bible,Language)[1]
 widLists[[i]] <- cbind(unlist(widLanList[[i]]),unlist(widList[[i]]))
}
widLists <- as.data.frame(do.call(rbind, widLists))
colnames(widLists) <- c("Language","WID")
widLists$Configuration <- "WID"
widLists <- widLists[order(widLists$Language),]
widLists$ID <- seq_along(widLists$WID)

# Add WID rank for each language
widLists <- widLists[order(widLists$WID, decreasing=TRUE),]
widLists$Rank <- seq_along(widLists$WID)
widLists$MeanWID <- widLists$WID
widLists <- widLists[,c("Language","Configuration","WID","MeanWID","Rank","ID")]
widLists <- widLists[order(widLists$ID, decreasing=FALSE),]

# 2) Calculate WID_FP with 20 subsets (57 verses per subset) 
# Load a list of bible text file names and language codes and count the number of words in each subset in English
listBible <- data[,c("Language","BibleFile")]
bibleENG <- listBible[listBible$Language=="English",]$BibleFile
bibleENG <- read_delim(bibleENG, "\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
wordCount <- as.data.frame(sapply(strsplit(bibleENG$X1, " "), length))
colnames(wordCount)[1] <- "ENG"
wordCount$ID <- seq_along(wordCount$ENG)

# Function to calculate WID_FP, using English as a reference
wid_FPCal <- function(bible, Language){
 bible <- read_delim(bible,"\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
 bible <- as.data.frame(sapply(strsplit(bible$X1, " "), length))
 colnames(bible)[1] <- Language
 bible$ID <- seq_along(bible[,1])
 wordCount <- merge(wordCount, bible, by="ID")
 wid <- list()
 for(j in 1:20){
    k <- 57*(j-1) + 1
    l <- 57*j
    wid[[j]] <- sum(wordCount$ENG[k:l])/sum(wordCount[,ncol(wordCount)][k:l])
 }
langWid_FP <- c(Language, wid)
return(langWid_FP)
}

# Calculate WID_FP by using the list of bible text file names and language codes
wid_FPList <- list()
wid_FPLists <- list()
for(i in 1:nrow(listBible)){
  bible <- paste0("./",listBible$BibleFile[i])
  Language <- listBible$Language[i] 
  wid_FPList[[i]] <- wid_FPCal(bible,Language)[2:21]
  wid_FPLists[[i]] <- cbind(Language, unlist(wid_FPList[[i]]))
}
wid_FPLists <- as.data.frame(do.call(rbind, wid_FPLists))
colnames(wid_FPLists) <- c("Language","WID")
wid_FPLists$Configuration <- "WID_FP"
wid_FPLists <- wid_FPLists[order(wid_FPLists$Language),]
wid_FPLists$ID <- seq_along(wid_FPLists$WID)

# Compute average WID_FP
wid_FPLists$WID <- as.numeric(as.character(wid_FPLists$WID))
meanWID_FP <- wid_FPLists %>% 
  group_by(Language) %>% 
  summarise(MeanWID = mean(WID))
meanWID_FP <- meanWID_FP[order(meanWID_FP$MeanWID, decreasing=TRUE),]
meanWID_FP$Rank <- seq_along(meanWID_FP$MeanWID) 

wid_FPLists <- merge(wid_FPLists, meanWID_FP, by="Language")
wid_FPLists <- wid_FPLists[,c("Language","Configuration","WID","MeanWID","Rank","ID")]
wid_FPLists <- wid_FPLists[order(wid_FPLists$ID, decreasing=FALSE),]

# 3) Calculate WID_PP with 47 subsets (24 verses per subset)
# Randomly split corpus into 47 subsets 
listBible <- data[,c("Language","BibleFile")]
rownames(listBible) <- unlist(listBible$Language)
listBible <- listBible[c("Amele","Alamblak","Arapesh (Mountain)","Apurinã","Mapudungun","Arabic (Egyptian)", "Barasano","Chamorro","German","Daga","Greek (Modern)","English","Basque","Fijian","Finnish","French", "Guaraní","Oromo (Harar)","Hausa","Hindi","Indonesian","Jakaltek","Greenlandic (West)","Georgian","Kewa", "Khalkha","Korean","Lango","Mixtec (Chalcatongo)","Burmese","Wichí","Khoekhoe","Persian","Malagasy","Quechua (Imbabura)","Russian","Sango","Spanish","Swahili","Tagalog","Thai","Turkish","Vietnamese","Sanumá","Yagua","Yaqui","Yoruba"),]
bibleENG <- listBible[listBible$Language=="English",]$BibleFile
bibleENG <- read_delim(bibleENG, "\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
corpusSize <- nrow(bibleENG)/47 
colnames(bibleENG)[1] <- "ENG"
bibleENG$ID <- seq_along(bibleENG$ENG)
corpusSubsetID <- list()
for(i in 1:47){
  j <- i*10
  set.seed(j)
  randomSeq <- sample(seq_len(nrow(bibleENG)), size = corpusSize)
  corpusSubsetID[[i]] <- c(bibleENG[randomSeq,]$ID)
  bibleENG <- bibleENG[-randomSeq,]
}
corpusSubsetIDList <- as.data.frame(do.call(cbind, corpusSubsetID))
colnames(corpusSubsetIDList) <- c(1:47)

# Randomly permutate the subsets assigned to 47 languages (20 times)
corpusSubsetRan20 <- list()
for(i in 1:20){
  j <- i*10
  set.seed(j)
  corpusSubsetRan20[[i]] <- sample(corpusSubsetIDList)
}
bibleENG <- listBible[listBible$Language=="English",]$BibleFile
bibleENG <- read_delim(bibleENG, "\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
wordCount <- as.data.frame(sapply(strsplit(bibleENG$X1, " "), length))
colnames(wordCount)[1] <- "ENG"
wordCount$ID <- seq_along(wordCount$ENG)

# Function to calculate WID_PP, using English as a reference
langWid <- list()
wid_PPCal <- function(subsetList){
  for(j in 1:47){
  Language <- listBible$Language[j]  
  bible <- paste0("./",listBible$BibleFile[j])
  bible <- read_delim(bible,"\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
  bible <- as.data.frame(sapply(strsplit(bible$X1, " "), length))
  colnames(bible)[1] <- Language
  bible$ID <- seq_along(bible[,1])
  subset <- unlist(subsetList[j])
  dataSubset <- wordCount[wordCount$ID %in% subset,]
  dataSubset <- merge(dataSubset, bible, by="ID")
  wid <- sum(dataSubset$ENG)/sum(dataSubset[,ncol(dataSubset)])
  langWid[[j]]  <- c(Language, wid)
  } 
  return(langWid)
}

# Calculate WID_PP by using the list of randomly permutated subsets assigned to 47 languages
wid_PPList <- list() 
for(i in 1:20){  
  subsetList <- corpusSubsetRan20[[i]]
  wid <- wid_PPCal(subsetList)
  wid_PPList[[i]] <- as.data.frame(do.call(rbind, wid))
}
wid_PPLists <- as.data.frame(do.call(rbind, wid_PPList))
colnames(wid_PPLists) <- c("Language","WID")
wid_PPLists$Configuration <- "WID_PP"
wid_PPLists <- wid_PPLists[order(wid_PPLists$Language),]
wid_PPLists$ID <- seq_along(wid_PPLists$WID)

# Compute average WID_PP
wid_PPLists$WID <- as.numeric(as.character(wid_PPLists$WID))
meanWID_PP <- wid_PPLists %>% 
  group_by(Language) %>% 
  summarise(MeanWID = mean(WID))
meanWID_PP <- meanWID_PP[order(meanWID_PP$MeanWID, decreasing=TRUE),]
meanWID_PP$Rank <- seq_along(meanWID_PP$MeanWID) 

wid_PPLists <- merge(wid_PPLists, meanWID_PP, by="Language")
wid_PPLists <- wid_PPLists[,c("Language","Configuration","WID","MeanWID","Rank","ID")]
wid_PPLists <- wid_PPLists[order(wid_PPLists$ID, decreasing=FALSE),]

# 4) Calculate WID_NP with 47 subsets (24 verses per subset)
# Load English surprisal estimated at the verse level 
surprisal <- read_delim("./surprisal.txt","\t", escape_double = FALSE, trim_ws = TRUE)
surprisal <- as.data.frame(as.numeric(surprisal$Surprisal))
colnames(surprisal)[1] <- "surprisalENG"
surprisal$ID <- seq_along(surprisal$surprisalENG)
surprisal$surprisalENG <-  surprisal$surprisalENG*-1

# Function to calculate WID_NP, by using English surprisal
langWid_NP <- list()
wid_NPCal <- function(subsetList){
  for(j in 1:47){
    Language <- listBible$Language[j]  
    bible <- paste0("./",listBible$BibleFile[j])
    bible <- read_delim(bible,"\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
    bible <- as.data.frame(sapply(strsplit(bible$X1, " "), length))
    colnames(bible)[1] <- Language
    bible$ID <- seq_along(bible[,1])
    subset <- unlist(subsetList[j])
    dataSubset <- surprisal[surprisal$ID %in% subset,]
    dataSubset <- merge(dataSubset, bible, by="ID")
    bibleENG <- listBible[listBible$Language=="English",]$BibleFile
    bibleENG <- read_delim(bibleENG, "\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
    bibleENG <- as.data.frame(sapply(strsplit(bibleENG$X1, " "), length))
    colnames(bibleENG)[1] <- "English"
    bibleENG$ID <- seq_along(bibleENG[,1])
    subsetENG <- unlist(subsetList[12])
    dataSubsetENG <- surprisal[surprisal$ID %in% subsetENG,]
    dataSubsetENG <- merge(dataSubsetENG, bibleENG, by="ID")
    wid_NP <- (sum(dataSubset$surprisalENG)/sum(dataSubset[,ncol(dataSubset)]))*(sum(dataSubsetENG$English)/sum(dataSubsetENG$surprisalENG))
    langWid_NP[[j]]  <- c(Language, wid_NP)
  } 
  return(langWid_NP)
}

# Calculate WID_NP by using the list of randomly permutated subsets assigned to 47 languages
wid_NPList <- list() 
for(i in 1:20){  
  subsetList <- corpusSubsetRan20[[i]]
  wid_NP <- wid_NPCal(subsetList)
  wid_NPList[[i]] <- as.data.frame(do.call(rbind, wid_NP))
}
wid_NPLists <- as.data.frame(do.call(rbind, wid_NPList))
colnames(wid_NPLists) <- c("Language","WID")
wid_NPLists$Configuration <- "WID_NP"
wid_NPLists <- wid_NPLists[order(wid_NPLists$Language),]
wid_NPLists$ID <- seq_along(wid_NPLists$WID)

# Compute average WID_NP
wid_NPLists$WID <- as.numeric(as.character(wid_NPLists$WID))
meanWID_NP <- wid_NPLists %>% 
  group_by(Language) %>% 
  summarise(MeanWID = mean(WID))
meanWID_NP <- meanWID_NP[order(meanWID_NP$MeanWID, decreasing=TRUE),]
meanWID_NP$Rank <- seq_along(meanWID_NP$MeanWID) 

wid_NPLists <- merge(wid_NPLists, meanWID_NP, by="Language")
wid_NPLists <- wid_NPLists[,c("Language","Configuration","WID","MeanWID","Rank","ID")]
wid_NPLists <- wid_NPLists[order(wid_NPLists$ID, decreasing=FALSE),]

# Combine all lists of WID configurations
allLists <- rbind(widLists, wid_FPLists, wid_PPLists, wid_NPLists)

6.1 Average WID by different corpus configurations

The distribution of average WID estimated from three different corpus configurations (WID_FP, WID_PP, and WID_NP) is displayed.

Languages ranked by WID (x-axis). Each panel corresponds to a different corpus sampling configuration. In each panel, languages are ranked by average WID over the subsets, potentially leading to differences in ranking across the panels

Languages ranked by WID (x-axis). Each panel corresponds to a different corpus sampling configuration. In each panel, languages are ranked by average WID over the subsets, potentially leading to differences in ranking across the panels

6.2 Rank of average WID by different corpus configurations

The average WID ranks assessed by using different corpus configurations are compared: WID (Whole parallel corpus), average WID_FP (Full Parallel corpus), WID_PP (Pairwise Parallel corpus), and WID_NP (Non-Parallel corpus).

Comparison of the information density computed on the whole corpus (WID, left column in each panel) and the information densities implemented following Figure 13 (right column in each panel)

Comparison of the information density computed on the whole corpus (WID, left column in each panel) and the information densities implemented following Figure 13 (right column in each panel)

6.3 Comparing information density estimations from parallel and non-parallel corpora

6.3.1 Correlation between Word Information Density estimations

Comparison between Word Information Densities estimated from Full Parallel (WID_FP), Pairwise Parallel (WID_PP), and Non-Parallel (WID_NP)Comparison between Word Information Densities estimated from Full Parallel (WID_FP), Pairwise Parallel (WID_PP), and Non-Parallel (WID_NP)

Comparison between Word Information Densities estimated from Full Parallel (WID_FP), Pairwise Parallel (WID_PP), and Non-Parallel (WID_NP)

6.3.2 Randomized permutation test for the correlation between WID_PP and WID_NP

The code below in this section describes the process of computing correlation coefficients between WID_PP and WID_NP from 1,000 randomized permutations of the English surprisal associated with each subset (20 subsets per language).

# Calculate WID_PP and WID_NP (with randomized surprisal)
# Randomly split corpus into 47 subsets (24 verses per subset)
listBible <- data[,c("Language","BibleFile")]
rownames(listBible) <- unlist(listBible$Language)
listBible <- listBible[c("Amele","Alamblak","Arapesh (Mountain)","Apurinã","Mapudungun","Arabic (Egyptian)", "Barasano","Chamorro","German","Daga","Greek (Modern)","English","Basque","Fijian","Finnish","French", "Guaraní","Oromo (Harar)","Hausa","Hindi","Indonesian","Jakaltek","Greenlandic (West)","Georgian","Kewa", "Khalkha","Korean","Lango","Mixtec (Chalcatongo)","Burmese","Wichí","Khoekhoe","Persian","Malagasy","Quechua (Imbabura)","Russian","Sango","Spanish","Swahili","Tagalog","Thai","Turkish","Vietnamese","Sanumá","Yagua","Yaqui","Yoruba"),]
bibleENG <- listBible[listBible$Language=="English",]$BibleFile
bibleENG <- read_delim(bibleENG, "\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
corpusSize <- nrow(bibleENG)/47 
colnames(bibleENG)[1] <- "ENG"
bibleENG$ID <- seq_along(bibleENG$ENG)
corpusSubsetID <- list()
for(i in 1:47){
  j <- i*10
  set.seed(j)
  randomSeq <- sample(seq_len(nrow(bibleENG)), size = corpusSize)
  corpusSubsetID[[i]] <- c(bibleENG[randomSeq,]$ID)
  bibleENG <- bibleENG[-randomSeq,]
}
corpusSubsetIDList <- as.data.frame(do.call(cbind, corpusSubsetID))
colnames(corpusSubsetIDList) <- c(1:47)

# Randomly permutate the subsets assigned to 47 languages (20 times)
corpusSubsetRan20 <- list()
for(i in 1:20){
  j <- i*10
  set.seed(j)
  corpusSubsetRan20[[i]] <- sample(corpusSubsetIDList)
}

# Calculate WID_PP and WID_NP with randomized surprisal and obtain Spearman's correlation coefficient between WID_PP and WID_NP (1,000 times)
estimate <- list()
for(k in 1:1000){
  surprisal <- read_delim("./surprisal.txt","\t", escape_double = FALSE, trim_ws = TRUE)
  set.seed(k*100)
  surprisal <- sample(surprisal$Surprisal, length(surprisal$Surprisal))
  surprisal <- as.data.frame(as.numeric(surprisal))
  colnames(surprisal)[1] <- "surprisalENG"
  surprisal$ID <- seq_along(surprisal$surprisalENG)
  surprisal$surprisalENG <-  surprisal$surprisalENG*-1
  langWid_NP <- list()
  wid_NPCal <- function(subsetList){
    for(j in 1:47){
      Language <- listBible$Language[j]  
      bible <- paste0("./",listBible$BibleFile[j])
      bible <- read_delim(bible,"\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
      bible <- as.data.frame(sapply(strsplit(bible$X1, " "), length))
      colnames(bible)[1] <- Language
      bible$ID <- seq_along(bible[,1])
      subset <- unlist(subsetList[j])
      dataSubset <- surprisal[surprisal$ID %in% subset,]
      dataSubset <- merge(dataSubset, bible, by="ID")
      bibleENG <- listBible[listBible$Language=="English",]$BibleFile
      bibleENG <- read_delim(bibleENG, "\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
      bibleENG <- as.data.frame(sapply(strsplit(bibleENG$X1, " "), length))
      colnames(bibleENG)[1] <- "English"
      bibleENG$ID <- seq_along(bibleENG[,1])
      subsetENG <- unlist(subsetList[12])
      dataSubsetENG <- surprisal[surprisal$ID %in% subsetENG,]
      dataSubsetENG <- merge(dataSubsetENG, bibleENG, by="ID")
      wid_NP <- (sum(dataSubset$surprisalENG)/sum(dataSubset[,ncol(dataSubset)]))*(sum(dataSubsetENG$English)/sum(dataSubsetENG$surprisalENG))
      langWid_NP[[j]]  <- c(Language, wid_NP)
    } 
    return(langWid_NP)
  }
  wid_NPList <- list() 
  for(i in 1:20){  
    subsetList <- corpusSubsetRan20[[i]]
    wid_NP <- wid_NPCal(subsetList)
    wid_NPList[[i]] <- as.data.frame(do.call(rbind, wid_NP))
  }
  wid_NPLists <- as.data.frame(do.call(rbind, wid_NPList))
  colnames(wid_NPLists) <- c("Language","WID_NP")
  wid_NPLists <- wid_NPLists[order(wid_NPLists$Language),]
  
  bibleENG <- listBible[listBible$Language=="English",]$BibleFile
  bibleENG <- read_delim(bibleENG, "\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
  wordCount <- as.data.frame(sapply(strsplit(bibleENG$X1, " "), length))
  colnames(wordCount)[1] <- "ENG"
  wordCount$ID <- seq_along(wordCount$ENG)
  langWid <- list()
  wid_PPCal <- function(subsetList){
    for(j in 1:47){
      Language <- listBible$Language[j]  
      bible <- paste0("./",listBible$BibleFile[j])
      bible <- read_delim(bible,"\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
      bible <- as.data.frame(sapply(strsplit(bible$X1, " "), length))
      colnames(bible)[1] <- Language
      bible$ID <- seq_along(bible[,1])
      subset <- unlist(subsetList[j])
      dataSubset <- wordCount[wordCount$ID %in% subset,]
      dataSubset <- merge(dataSubset, bible, by="ID")
      wid <- sum(dataSubset$ENG)/sum(dataSubset[,ncol(dataSubset)])
      langWid[[j]]  <- c(Language, wid)
    } 
    return(langWid)
  }
  wid_PPList <- list() 
  for(i in 1:20){  
    subsetList <- corpusSubsetRan20[[i]]
    wid <- wid_PPCal(subsetList)
    wid_PPList[[i]] <- as.data.frame(do.call(rbind, wid))
  }
  wid_PPLists <- as.data.frame(do.call(rbind, wid_PPList))
  colnames(wid_PPLists) <- c("Language","WID_PP")
  wid_PPLists <- wid_PPLists[order(wid_PPLists$Language),]
  widLists <- cbind(wid_PPLists, wid_NPLists)
  corTest <- cor.test(as.numeric(widLists$WID_PP), as.numeric(widLists$WID_NP), method="spearman",exact=FALSE)
  estimate[[k]] <- corTest$estimate
  rList <- unlist(estimate) 
}

# Generate the dataset from the first randomization of surprisal
# Randomly split corpus into 47 subsets 
listBible <- data[,c("Language","BibleFile")]
rownames(listBible) <- unlist(listBible$Language)
listBible <- listBible[c("Amele","Alamblak","Arapesh (Mountain)","Apurinã","Mapudungun","Arabic (Egyptian)", "Barasano","Chamorro","German","Daga","Greek (Modern)","English","Basque","Fijian","Finnish","French", "Guaraní","Oromo (Harar)","Hausa","Hindi","Indonesian","Jakaltek","Greenlandic (West)","Georgian","Kewa", "Khalkha","Korean","Lango","Mixtec (Chalcatongo)","Burmese","Wichí","Khoekhoe","Persian","Malagasy","Quechua (Imbabura)","Russian","Sango","Spanish","Swahili","Tagalog","Thai","Turkish","Vietnamese","Sanumá","Yagua","Yaqui","Yoruba"),]
bibleENG <- listBible[listBible$Language=="English",]$BibleFile
bibleENG <- read_delim(bibleENG, "\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
corpusSize <- nrow(bibleENG)/47 
colnames(bibleENG)[1] <- "ENG"
bibleENG$ID <- seq_along(bibleENG$ENG)
corpusSubsetID <- list()
for(i in 1:47){
  j <- i*10
  set.seed(j)
  randomSeq <- sample(seq_len(nrow(bibleENG)), size = corpusSize)
  corpusSubsetID[[i]] <- c(bibleENG[randomSeq,]$ID)
  bibleENG <- bibleENG[-randomSeq,]
}
corpusSubsetIDList <- as.data.frame(do.call(cbind, corpusSubsetID))
colnames(corpusSubsetIDList) <- c(1:47)

# Randomly permutate the subsets assigned to 47 languages (20 times)
corpusSubsetRan20 <- list()
for(i in 1:20){
  j <- i*10
  set.seed(j)
  corpusSubsetRan20[[i]] <- sample(corpusSubsetIDList)
}

# Calculate WID_PP and WID_NP with randomized surprisal
for(k in 1:1){
  surprisal <- read_delim("./surprisal.txt","\t", escape_double = FALSE, trim_ws = TRUE)
  set.seed(k*100)
  surprisal <- sample(surprisal$Surprisal, length(surprisal$Surprisal))
  surprisal <- as.data.frame(as.numeric(surprisal))
  colnames(surprisal)[1] <- "surprisalENG"
  surprisal$ID <- seq_along(surprisal$surprisalENG)
  surprisal$surprisalENG <-  surprisal$surprisalENG*-1
  langWid_NP <- list()
  wid_NPCal <- function(subsetList){
    for(j in 1:47){
      Language <- listBible$Language[j]  
      bible <- paste0("./",listBible$BibleFile[j])
      bible <- read_delim(bible,"\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
      bible <- as.data.frame(sapply(strsplit(bible$X1, " "), length))
      colnames(bible)[1] <- Language
      bible$ID <- seq_along(bible[,1])
      subset <- unlist(subsetList[j])
      dataSubset <- surprisal[surprisal$ID %in% subset,]
      dataSubset <- merge(dataSubset, bible, by="ID")
      bibleENG <- listBible[listBible$Language=="English",]$BibleFile
      bibleENG <- read_delim(bibleENG, "\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
      bibleENG <- as.data.frame(sapply(strsplit(bibleENG$X1, " "), length))
      colnames(bibleENG)[1] <- "English"
      bibleENG$ID <- seq_along(bibleENG[,1])
      subsetENG <- unlist(subsetList[12])
      dataSubsetENG <- surprisal[surprisal$ID %in% subsetENG,]
      dataSubsetENG <- merge(dataSubsetENG, bibleENG, by="ID")
      wid_NP <- (sum(dataSubset$surprisalENG)/sum(dataSubset[,ncol(dataSubset)]))*(sum(dataSubsetENG$English)/sum(dataSubsetENG$surprisalENG))
      langWid_NP[[j]]  <- c(Language, wid_NP)
    } 
    return(langWid_NP)
  }
  wid_NPList <- list() 
  for(i in 1:20){  
    subsetList <- corpusSubsetRan20[[i]]
    wid_NP <- wid_NPCal(subsetList)
    wid_NPList[[i]] <- as.data.frame(do.call(rbind, wid_NP))
  }
  wid_NPLists <- as.data.frame(do.call(rbind, wid_NPList))
  colnames(wid_NPLists) <- c("Language","WID_NP")
  wid_NPLists <- wid_NPLists[order(wid_NPLists$Language),]
  
  bibleENG <- listBible[listBible$Language=="English",]$BibleFile
  bibleENG <- read_delim(bibleENG, "\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
  wordCount <- as.data.frame(sapply(strsplit(bibleENG$X1, " "), length))
  colnames(wordCount)[1] <- "ENG"
  wordCount$ID <- seq_along(wordCount$ENG)
  langWid <- list()
  wid_PPCal <- function(subsetList){
    for(j in 1:47){
      Language <- listBible$Language[j]  
      bible <- paste0("./",listBible$BibleFile[j])
      bible <- read_delim(bible,"\t", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE)
      bible <- as.data.frame(sapply(strsplit(bible$X1, " "), length))
      colnames(bible)[1] <- Language
      bible$ID <- seq_along(bible[,1])
      subset <- unlist(subsetList[j])
      dataSubset <- wordCount[wordCount$ID %in% subset,]
      dataSubset <- merge(dataSubset, bible, by="ID")
      wid <- sum(dataSubset$ENG)/sum(dataSubset[,ncol(dataSubset)])
      langWid[[j]]  <- c(Language, wid)
    } 
    return(langWid)
  }
  wid_PPList <- list() 
  for(i in 1:20){  
    subsetList <- corpusSubsetRan20[[i]]
    wid <- wid_PPCal(subsetList)
    wid_PPList[[i]] <- as.data.frame(do.call(rbind, wid))
  }
  wid_PPLists <- as.data.frame(do.call(rbind, wid_PPList))
  colnames(wid_PPLists) <- c("Language","WID_PP")
  wid_PPLists <- wid_PPLists[order(wid_PPLists$Language),]
  wid_PP_NP_Lists <- cbind(wid_PPLists, wid_NPLists)
  wid_PP_NP_Lists[3] <- NULL
  colnames(wid_PP_NP_Lists)[1] <- "Language"
}
Comparison of the global correlation between WID_PP and WID_NP (red vertical line) and the distribution of correlations obtained from 1,000 randomized permutations (histogram in yellow).

Comparison of the global correlation between WID_PP and WID_NP (red vertical line) and the distribution of correlations obtained from 1,000 randomized permutations (histogram in yellow).

6.3.3 Correlation between WID_PP and WID_NP in each language

40 out of 46 languages exhibit a positive within-language correlation between WID_PP and WID_NP.

6.3.4 Correlation between WID_PP and WID_NP (with randomized permutation) in each language

A majority of languages exhibit a negative or no within-language correlation between WID_PP and WID_NP with randomized permutation.

7 References

Bickel, Balthasar & Johanna Nichols. 2013. Chapter 22: Inflectional synthesis of the verb, In Matthew S. Dryer & Martin Haspelmath (eds.). The World Atlas of Language Structures Online. Leipzig: Max Planck Institute for Evolutionary Anthropology. (Available online at http://wals.info/chapter/22)

Bickel, Balthasar, Johanna Nichols, Taras Zakharko, Alena Witzlack-Makarevich, Kristine Hildebrandt, Michael Rießler, Lennart Bierkandt, Fernando Zúñig & John B. Lowe. 2017. The AUTOTYP Typological Databases. Version 0.1.0 https://github.com/autotyp/autotyp-data/tree/0.1.0

Easterday, Shelece, Matthew Stave, Marc Allassonnière-Tang & Frank Seifart. 2021. Syllable complexity and morphological synthesis: a well-motivated positive complexity correlation across subdomains. Frontiers in psychology 12, 583.

Gomez-Imbert, Elsa & Michael Kenstowicz. 2000. Barasana tone and accent. International Journal of American Linguistics 66(4), 419-463.

Jones, Wendell & Paula Jones. 1991. Barasano Syntax. (Publications in Linguistics, 101.) Dallas: Summer Institute of Linguistics and The University of Texas at Arlington.

Maddieson, Ian, Sébastien Flavier, Egidio Marsico, Christophe Coupé & François Pellegrino. 2013. LAPSyd: Lyon-Albuquerque phonological systems database. In Proceedings of the 14th Interspeech Conference, Lyon, France.

Maddieson, Ian. 2013. Syllable Structure. In Matthew S. Dryer & Martin Haspelmath (eds.). The World Atlas of Language Structures Online. Leipzig: Max Planck Institute for Evolutionary Anthropology. (Available online at http://wals.info/chapter/12)

McCarthy, Philip M. & Scott Jarvis. 2010. MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity assessment. Behavior research methods 42, 381-392.

Svantesson, Jan-Olof. 1994. Mongolian syllable structure. Working Papers 42, 225-239. Lund University, Department of Linguistics.

8 Paper figures

8.1 Figure 1

Figure 1. a) Geographical distribution. b) Distribution of the languages among WALS classical typological features and symbolic codes. Marker color and shape respectively encodes the fusion strategy and the exponence category. Marker size further indicates whether verbal inflection is limited (small size for Low values) or more extended (large size for Mid and High values). In each cell, the number of languages is displayed when different from zero.

Figure 1. a) Geographical distribution. b) Distribution of the languages among WALS classical typological features and symbolic codes. Marker color and shape respectively encodes the fusion strategy and the exponence category. Marker size further indicates whether verbal inflection is limited (small size for Low values) or more extended (large size for Mid and High values). In each cell, the number of languages is displayed when different from zero.

8.2 Figure 2

Figure 2. Grammar-based Morphological Complexity based on WALS (GMC_W). On the x-axis, languages are ordered by increasing GMC_W values from left to right. GMC_W is by definition normalized between -1 and 0. Marker convention is shown in Figure 1b (also in Supplementary Information (§8.1)).

Figure 2. Grammar-based Morphological Complexity based on WALS (GMC_W). On the x-axis, languages are ordered by increasing GMC_W values from left to right. GMC_W is by definition normalized between -1 and 0. Marker convention is shown in Figure 1b (also in Supplementary Information (§8.1)).

8.3 Figure 3

Figure 3. Grammar-based verbal inflectional complexity based on AUTOTYP (GMC_A). On the x-axis, languages are ordered by increasing GMC_A values from left to right. Marker convention is shown in Figure 1b (also in Supplementary Information (§8.1)).

Figure 3. Grammar-based verbal inflectional complexity based on AUTOTYP (GMC_A). On the x-axis, languages are ordered by increasing GMC_A values from left to right. Marker convention is shown in Figure 1b (also in Supplementary Information (§8.1)).

8.4 Figure 4

Figure 4. Language distribution in a two-dimensional space defined by Grammar-based Morphological Complexities based on WALS (GMC_W, x-axis) and AUTOTYP (GMC_A, y-axis). Marker convention is shown in Figure 1b (also in Supplementary Information (§8.1)).

Figure 4. Language distribution in a two-dimensional space defined by Grammar-based Morphological Complexities based on WALS (GMC_W, x-axis) and AUTOTYP (GMC_A, y-axis). Marker convention is shown in Figure 1b (also in Supplementary Information (§8.1)).

8.5 Figure 5

Figure 5. Languages ranked by Type-Token Ratio (TTR, x-axis). Each panel corresponds to a different corpus sampling configuration, from one unique sample (Whole set, top left panel) to 60 samples (bottom right panel). In each panel, languages are ranked by average TTR over the subsets, potentially leading to differences in ranking across the panels.

Figure 5. Languages ranked by Type-Token Ratio (TTR, x-axis). Each panel corresponds to a different corpus sampling configuration, from one unique sample (Whole set, top left panel) to 60 samples (bottom right panel). In each panel, languages are ranked by average TTR over the subsets, potentially leading to differences in ranking across the panels.

8.6 Figure 6

Figure 6. Languages ranked by word-level Entropy (H, x-axis). The Figure convention is the same as in Figure 5.

Figure 6. Languages ranked by word-level Entropy (H, x-axis). The Figure convention is the same as in Figure 5.

8.7 Figure 7

Figure 7. Languages ranked by Measure of Textual Lexical Diversity (MTLD, x-axis). The Figure convention is the same as in Figure 5.

Figure 7. Languages ranked by Measure of Textual Lexical Diversity (MTLD, x-axis). The Figure convention is the same as in Figure 5.

8.8 Figure 8

Figure 8. Languages ranked by Word Information Density (WID, x-axis). The Figure convention is the same as in Figure 5.

Figure 8. Languages ranked by Word Information Density (WID, x-axis). The Figure convention is the same as in Figure 5.

8.9 Figure 9

Figure 9. Impact of the sampling configuration on the four complexity indices (TTR: Type-Token Ratio; H: word-level Entropy; MTLD: Measure of Textual Lexical Diversity; WID: Word Information Density). In each panel, the y-axis shows the language ranks according to the sampling configuration (whole set, 5, 10, 20, 40, and 60 subsets, on the x-axis). Languages are displayed in gray when rank is preserved throughout all configurations and in orange when changes occur, with orange edges underlying the changes.

Figure 9. Impact of the sampling configuration on the four complexity indices (TTR: Type-Token Ratio; H: word-level Entropy; MTLD: Measure of Textual Lexical Diversity; WID: Word Information Density). In each panel, the y-axis shows the language ranks according to the sampling configuration (whole set, 5, 10, 20, 40, and 60 subsets, on the x-axis). Languages are displayed in gray when rank is preserved throughout all configurations and in orange when changes occur, with orange edges underlying the changes.

8.10 Figure 10

Figure 10. Distribution of the languages according to their information encoding strategies along the Inter-Word Information and Word Information Density (IWI and WID, respectively) dimensions. Marker convention is shown in Figure 1b (also in Supplementary Information (§8.1)).

Figure 10. Distribution of the languages according to their information encoding strategies along the Inter-Word Information and Word Information Density (IWI and WID, respectively) dimensions. Marker convention is shown in Figure 1b (also in Supplementary Information (§8.1)).

8.11 Figure 11

Figure 11. Languages ranked by increasing Language EXplicitness (LEX). Marker convention is shown in Figure 1b (also in Supplementary Information (§8.1)).

Figure 11. Languages ranked by increasing Language EXplicitness (LEX). Marker convention is shown in Figure 1b (also in Supplementary Information (§8.1)).

8.12 Figure 12

Figure 12. Schematic representation of the three cross-linguistic estimations of Word Information Density (WID) implemented. Left panel: Full Parallel configuration with a common dataset for all languages. Central panel: Pairwise Parallel configuration, with a shared subset for each language and its English translation but different subsets across languages. Right panel: Non-Parallel configuration, with a different subset for each language, including English.

Figure 12. Schematic representation of the three cross-linguistic estimations of Word Information Density (WID) implemented. Left panel: Full Parallel configuration with a common dataset for all languages. Central panel: Pairwise Parallel configuration, with a shared subset for each language and its English translation but different subsets across languages. Right panel: Non-Parallel configuration, with a different subset for each language, including English.

8.13 Figure 13

Figure 13. Comparison of the information density computed on the whole corpus (WID, left column in each panel) and the information densities implemented following Figure 12 (right column in each panel).

Figure 13. Comparison of the information density computed on the whole corpus (WID, left column in each panel) and the information densities implemented following Figure 12 (right column in each panel).

8.14 Figure 14

Figure 14. Large panel: comparison between information densities estimated from Non-Parallel and Pairwise-Parallel configurations (20 subsets per language). Small panel: comparison of the observed global correlation on the large panel (red vertical line) and the distribution of correlations obtained from 1000 randomized permutations (histogram in yellow).Figure 14. Large panel: comparison between information densities estimated from Non-Parallel and Pairwise-Parallel configurations (20 subsets per language). Small panel: comparison of the observed global correlation on the large panel (red vertical line) and the distribution of correlations obtained from 1000 randomized permutations (histogram in yellow).

Figure 14. Large panel: comparison between information densities estimated from Non-Parallel and Pairwise-Parallel configurations (20 subsets per language). Small panel: comparison of the observed global correlation on the large panel (red vertical line) and the distribution of correlations obtained from 1000 randomized permutations (histogram in yellow).